Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

org.elasticsearch.xpack.slm.SnapshotLifecycleIT Fails due to Rate Limiting #46205

Closed
original-brownbear opened this issue Aug 31, 2019 · 6 comments
Assignees
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI

Comments

@original-brownbear
Copy link
Member

There's numerous failures of org.elasticsearch.xpack.slm.SnapshotLifecycleIT at the moment. The reason for this is that these tests use rate limiting with very low rate limits on the snapshot repository to simulate snapshot aborts and other concurrent scenarios.

Example Failure -> https://gradle-enterprise.elastic.co/s/sscyvnvkf23gy/console-log


Suite: Test class org.elasticsearch.xpack.slm.SnapshotLifecycleIT
--
1> [2019-08-31T12:22:27,369][INFO ][o.e.x.s.SnapshotLifecycleIT] [testPolicyFailure] before test
1> [2019-08-31T12:22:27,385][INFO ][o.e.x.s.SnapshotLifecycleIT] [testPolicyFailure] initializing REST clients against [http://[::1]:32827, http://127.0.0.1:33495, http://[::1]:43357, http://127.0.0.1:37929, http://[::1]:34097, http://127.0.0.1:43543, http://[::1]:36011, http://127.0.0.1:36555]
1> [2019-08-31T12:22:31,480][INFO ][o.e.x.s.SnapshotLifecycleIT] [testPolicyFailure] after test
1> [2019-08-31T12:22:31,539][INFO ][o.e.x.s.SnapshotLifecycleIT] [testPolicyManualExecution] before test
1> [2019-08-31T12:22:35,781][INFO ][o.e.x.s.SnapshotLifecycleIT] [testPolicyManualExecution] after test
1> [2019-08-31T12:22:35,824][INFO ][o.e.x.s.SnapshotLifecycleIT] [testSnapshotInProgress] before test
1> [2019-08-31T12:22:38,715][INFO ][o.e.x.s.SnapshotLifecycleIT] [testSnapshotInProgress] after test
1> [2019-08-31T12:22:38,769][INFO ][o.e.x.s.SnapshotLifecycleIT] [testFullPolicySnapshot] before test
1> [2019-08-31T12:24:40,221][INFO ][o.e.x.s.SnapshotLifecycleIT] [testFullPolicySnapshot] There are still tasks running after this test that might break subsequent tests [cluster:admin/repository/put].
1> [2019-08-31T12:24:40,222][INFO ][o.e.x.s.SnapshotLifecycleIT] [testFullPolicySnapshot] after test
2> REPRODUCE WITH: ./gradlew :x-pack:plugin:ilm:qa:multi-node:integTestRunner --tests "org.elasticsearch.xpack.slm.SnapshotLifecycleIT.testFullPolicySnapshot" -Dtests.seed=4A8659A9FC9A2C94 -Dtests.security.manager=true -Dtests.locale=fy -Dtests.timezone=Atlantic/Azores -Dcompiler.java=12 -Druntime.java=11
2> java.net.SocketTimeoutException: 30.000 milliseconds timeout on connection http-outgoing-40 [ACTIVE]
at __randomizedtesting.SeedInfo.seed([4A8659A9FC9A2C94:435FF4A6DC4174C1]:0)
at org.elasticsearch.client.RestClient.extractAndWrapCause(RestClient.java:778)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:218)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:221)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:221)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:221)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:205)
at org.elasticsearch.xpack.slm.SnapshotLifecycleIT.inializeRepo(SnapshotLifecycleIT.java:379)
at org.elasticsearch.xpack.slm.SnapshotLifecycleIT.inializeRepo(SnapshotLifecycleIT.java:364)
at org.elasticsearch.xpack.slm.SnapshotLifecycleIT.testFullPolicySnapshot(SnapshotLifecycleIT.java:85)
 
Caused by:
java.net.SocketTimeoutException: 30.000 milliseconds timeout on connection http-outgoing-40 [ACTIVE]

The problem with this approach is that low rate limits can lead to extremely long sleep times in the rate limiter. In one spot we limit to 1b/s but read 8k in one go -> we get minutes of sleeping. These tests passed more often before #42791 and #45689 but that PR changed timings in a way that made this trigger more often (I think this is due to the fact that we now write data in the first step of snapshotting and thus simply build up long waits before the concurrent action is tested ... before we had quite a bit of delay from first writing the snapshot metadata).
I tried fixing improving this situation by using only a single snapshot thread in #46195 but it wasn't enough evidently.

I'll see what better solution to these tests I can find here.

@original-brownbear original-brownbear added the :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs label Aug 31, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@original-brownbear original-brownbear added the >test-failure Triaged test failures from CI label Aug 31, 2019
@original-brownbear original-brownbear self-assigned this Aug 31, 2019
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Aug 31, 2019
original-brownbear added a commit that referenced this issue Aug 31, 2019
jkakavas pushed a commit to jkakavas/elasticsearch that referenced this issue Sep 2, 2019
@original-brownbear
Copy link
Member Author

original-brownbear commented Sep 2, 2019

@dakrone is there a specific reason these tests have to be REST tests in the first place? Can we maybe just make them ESIntegTestCase tests and use MockRepository like we do to test similar things in ES core?
Porting tests to that would be easier and give us more stable tests than trying to fix the abort logic (which might be worth while in and of itself but def. isn't so trivial for now).

@dakrone
Copy link
Member

dakrone commented Sep 3, 2019

@original-brownbear the reason they're REST tests is the general recommendation to move away from transport-client based tests and prefer REST-based tests. I haven't used MockRepository before, but it appears that it would support what we need with its block*/unblock methods.

If we're okay with adding a new feature relying on ESIntegTestCase, then we can move the tests to be transport-based.

@original-brownbear
Copy link
Member Author

@dakrone I think it's to depend on EsIntegTestCase here. You can still get coverage of the REST layer by other tests, but setting up the blocking safely in these YAML tests is hard.

Assuming we go ahead with this, want me to port these tests or do you want to do it? :)

@dakrone
Copy link
Member

dakrone commented Sep 3, 2019

@original-brownbear not sure I follow where you are getting YAML from? None of these are YAML tests?

Assuming we go ahead with this, want me to port these tests or do you want to do it? :)

I think this should wait until the slm-retention branch has been merged to master (which I can work on shortly now that we have a 7.5 branch), otherwise the merge conflicts are going to be hideous and the tests will have to be rewritten twice. Once it's merged then we can rewrite the tests (either me or you)

@original-brownbear
Copy link
Member Author

original-brownbear commented Sep 3, 2019

@dakrone sorry, I meant REST instead of YAML. Getting up at 4am didn't do me much good today it seems :) But yea, I'm fine waiting here. I'll assign you then, so since you'll do the merging of the slm-retention branch as well so you can time things accordingly.

Thanks!

dakrone added a commit to dakrone/elasticsearch that referenced this issue Sep 4, 2019
This commit splits `SnapshotLifecycleIT` into two different tests.
`SnapshotLifecycleRestIT` which includes the tests that do not require
slow repositories, and `SLMSnapshotBlockingIntegTests` which is now an
integration test using `MockRepository` to simulate a snapshot being in
progress.

Relates to elastic#43663
Resolves elastic#46205
dakrone added a commit that referenced this issue Sep 5, 2019
* Rewrite SnapshotLifecycleIT as as ESIntegTestCase

This commit splits `SnapshotLifecycleIT` into two different tests.
`SnapshotLifecycleRestIT` which includes the tests that do not require
slow repositories, and `SLMSnapshotBlockingIntegTests` which is now an
integration test using `MockRepository` to simulate a snapshot being in
progress.

Relates to #43663
Resolves #46205

* Add error logging when exceptions are thrown
kaypeter87 added a commit to kaypeter87/elasticsearch that referenced this issue Sep 9, 2019
* Put error message from inside the process into the exception that is thrown when the process doesn't start correctly. (#45846)

* update bwcVersions

* [DOCS] Reformat match query (#45152)

* Fix update-by-query script examples (#43907)

Two examples had swapped the order of lang and code when creating a
script.

Relates #43884

* Adjusting ML usage object serialization bwc version (#45874)

* Fsync translog without writeLock before rolling (#45765)

Today, when rolling a new translog generation, we block all write
threads until a new generation is created. This choice is perfectly 
fine except in a highly concurrent environment with the translog 
async setting. We can reduce the blocking time by pre-sync the 
current generation without writeLock before rolling. The new step 
would fsync most of the data of the current generation without 
blocking write threads.

Close #45371

* Add node.processors setting in favor of processors (#45855)

This commit namespaces the existing processors setting under the "node"
namespace. In doing so, we deprecate the existing processors setting in
favor of node.processors.

* Remove binary file accidentally committed

🤦‍♀️

* Fix TransportSnapshotsStatusAction ThreadPool Use (#45824)

In case of an in-progress snapshot this endpoint was broken because
it tried to execute repository operations in the callback on a
transport thread which is not allowed (only generic or snapshot
pool are allowed here).

* Enable testing against JDK 14 (#45178)

This commit enables testing against JDK 14.

* [DOCS] Add anchor to  version types list. (#45886)

* Adding a warning to from-size.asciidoc

Customers occasionally discover a known behavior in Elasticsearch's pagination that does not appear to be documented. This warning is intended to educate customers of this behavior while still highlighting alternative solutions.

* Remove redundant Java check from Sys V init (#45793)

In the Sys V init scripts, we check for Java. This is not needed, since
the same check happens in elasticsearch-env when starting up. Having
this duplicate check has bitten us in the past, where we made a change
to the logic in elasticsearch-env, but missed updating it here. Since
there is no need for this duplicate check, we remove it from the Sys V
init scripts.

* Update joda to 2.10.3 (#45495)

* Allow partial request body reads in AWS S3 retries tests (#45847)

This commit changes the tests added in #45383 so that the fixture that 
emulates the S3 service now sometimes consumes all the request body 
before sending an error, sometimes consumes only a part of the request 
body and sometimes consumes nothing. The idea here is to beef up a bit 
the tests that writes blob because the client's retry logic relies on 
marking and resetting the blob's input stream.

This pull request also changes the testWriteBlobWithRetries() so that it 
(rarely) tests with a large blob (up to 1mb), which is more than the client's 
default read limit on input streams (131Kb).

Finally, it optimizes the ZeroInputStream so that it is a bit more effective 
(now works using an internal buffer and System.arraycopy() primitives).

* Move testRetentionLeasesClearedOnRestore (#45896)

* [DOCS] Reformat put mapping API docs (#45709)

* Fix RemoteClusterConnection close race (#45898)

Closing a `RemoteClusterConnection` concurrently with trying to connect
could result in double invoking the listener.

This fixes
RemoteClusterConnectionTest#testCloseWhileConcurrentlyConnecting

Closes #45845

* [ML][Transforms] fix doSaveState check (#45882)

* [ML][Transforms] fix doSaveState check

* removing unnecessary log statement

* [ML] Improve progress reportings for DF analytics (#45856)

Previously, the stats API reports a progress percentage
for DF analytics tasks that are running and are in the
`reindexing` or `analyzing` state.

This means that when the task is `stopped` there is no progress
reported. Thus, one cannot distinguish between a task that never
run to one that completed.

In addition, there are blind spots in the progress reporting.
In particular, we do not account for when data is loaded into the
process. We also do not account for when results are written.

This commit addresses the above issues. It changes progress
to being a list of objects, each one describing the phase
and its progress as a percentage. We currently have 4 phases:
reindexing, loading_data, analyzing, writing_results.

When the task stops, progress is persisted as a document in the
state index. The stats API now reports progress from in-memory
if the task is running, or returns the persisted document
(if there is one).

* Expose the ability to cancel async requests in REST high-level client (#45688)

This commits makes all the async methods in the high level client return the `Cancellable` object that the low level client now exposes.

Relates to #45379 
Closes #44802

* Fix IngestService to respect original document content type (#45799)

This PR modifies the logic in IngestService to preserve the original content type 
on the IndexRequest, such that when a document with a content type like SMILE 
is submitted to a pipeline, the resulting document that is persisted will remain in 
the original content type (SMILE in this case).

* Change `{var}` convention to `<var>` (#45904)

* Fix bugs in Painless SCatch node (#45880)

This fixes two bugs:
- A recently introduced bug where an NPE will be thrown if a catch block is 
empty.
- A long-time bug where an NPE will be thrown if multiple catch blocks in a 
row are empty for the same try block.

* Update translog checkpoint after marking ops as persisted (#45634)

If two translog syncs happen concurrently, then one can return before
its operations are marked as persisted. In general, this should not be
an issue; however, peer recoveries currently rely on this assumption.

Closes #29161

* [DOCS] Reformat get index API docs (#45758)

* [DOCS] Reformat delete index API docs (#45755)

* Handle multiple loopback addresses (#45901)

AbstractSimpleTransportTestCase.testTransportProfilesWithPortAndHost
expects a host to only have a single IPv4 loopback address, which isn't
necessarily the case. Allow for >= 1 address.

* [DOCS] Relocate Ingest API docs to REST API section (#45812)

* [ML][Transforms] adjusting when and what to audit (#45876)

* [ML][Transforms] adjusting when and what to audit

* Update DataFrameTransformTask.java

* removing unnecessary audit message

* Remove processors setting (#45905)

The processors setting was deprecated in version 7.4.0 of Elasticsearch
for removal in Elasticsearch 8.0.0. This commit removes the processors
setting.

* Remove translating processors in Docker entrypoint (#45923)

Now that processors is no longer a valid Elasticsearch setting, this
commit removes translation for it in the Docker entrypoint.

* Deprecate the pidfile setting (#45938)

This commit deprecates the pidfile setting in favor of node.pidfile.

* Adjust node.pidfile version in cluster formation

Now that the deprecation of pidfile has been backported to 7.4.0, this
commit adjusts the version-conditional logic in cluster formation tasks
for setting pidfile versus node.pidfile.

* Remove non task aware execute methods from TransportAction (#45821)

The TransportAction class has several ways to execute the action, some
of which will create a task. This commit removes those non task aware
variants in favor of handling task creation inside NodeClient for local
actions.

* Remove the pidfile setting (#45940)

The pidfile setting was deprecated in version 7.4.0 of Elasticsearch for
removal in Elasticsearch 8.0.0. This commit removes the pidfile setting.

* Allow Transport Actions to indicate authN realm (#45767)

This commit allows the Transport Actions for the SSO realms to
indicate the realm that should be used to authenticate the
constructed AuthenticationToken. This is useful in the case that
many authentication realms of the same type have been configured
and where the caller of the API(Kibana or a custom web app) already
know which realm should be used so there is no need to iterate all
the realms of the same type.
The realm parameter is added in the relevant REST APIs as optional
so as not to introduce any breaking change.

* re-enable BWC tests after merging #45767 (#45948)

* Fix plaintext on TLS port logging (#45852)

Today if non-TLS record is received on TLS port generic exception will
be logged with the stack-trace.
SSLExceptionHelper.isNotSslRecordException method does not work because
it's assuming that NonSslRecordException would be top-level.
This commit addresses the issue and the log would be more concise.

* Add Test Logging for #45953 (#45957)

Adding some logging to track down #45953 and making the failing assertion log more detail

* [DOCS] Reformat create index API docs (#45749)

* Fix SnapshotStatusApisIT (#45929)

The snapshot status when blocking can still be INIT in rare cases when
the new cluster state that has the snapshot in `STARTED` hasn't yet
become visible.
Fixes #45917

* Fix Broken HTTP Request Breaking Channel Closing (#45958)

This is essentially the same issue fixed in #43362 but for http request
version instead of the request method. We have to deal with the
case of not being able to parse the request version, otherwise
channel closing fails.

Fixes #43850

* Refactor RepositoryCredentialsTests (#45919)

This commit refactors the S3 credentials tests in 
RepositoryCredentialsTests so that it now uses a single 
node (ESSingleNodeTestCase) to test how secure/insecure 
credentials are overriding each other. Using a single node 
makes it much easier to understand what each test is actually 
testing and IMO better reflect how things are initialized.

It also allows to fold into this class the test 
testInsecureRepositoryCredentials which was wrongly located 
in S3BlobStoreRepositoryTests. By moving this test away, the 
S3BlobStoreRepositoryTests class does not need the 
allow_insecure_settings option anymore and thus can be 
executed as part of the usual gradle test task.

* [DOCS] Reformat get settings API docs (#45924)

* Better logging for TLS message on non-secure transport channel (#45835)

This commit enhances logging for 2 cases:

1. If non-TLS enabled node receives transport message from TLS enabled
node on transport port.
2. If non-TLS enabled node receives HTTPs request on transport port.

* Relax translog assertion in testRestoreLocalHistoryFromTranslog (#45943)

Since #45473, we trim translog below the local checkpoint of the safe
commit immediately if soft-deletes enabled. In
testRestoreLocalHistoryFromTranslog, we should have a safe commit after
recoverFromTranslog is called; then we will trim translog files which
contain only operations that are at most the global checkpoint.

With this change, we relax the assertion to ensure that we don't put
operations to translog while recovering history from the local translog.

* Consider artifact repositories backed by S3 secure (#45950)

Since credentials are required to access such a repository, and these
repositories are accessed over an encrypted protocol (https), this
commit adds support to consider S3-backed artifact repositories as
secure. Additionally, we add tests for this functionality.

* Build: Support `console-result` language (#45937)

This adds support for verifying that snippets with the `console-result`
language are valid json. It also switches the response snippets on the
`docs/get` page from `js` to `console-result` which will allow clients
to provide "alternatives" for them like they can now do with
`// CONSOLE` snippets.

* [DOCS] Reformat indices exists API docs (#45918)

* [DOCS] Reformat get field mapping API docs (#45700)

* Add Cumulative Cardinality agg (and Data Science plugin) (#43661)

This adds a pipeline aggregation that calculates the cumulative
cardinality of a field.  It does this by iteratively merging in the
HLL sketch from consecutive buckets and emitting the cardinality up
to that point.

This is useful for things like finding the total "new" users that have
visited a website (as opposed to "repeat" visitors).

This is a Basic+ aggregation and adds a new Data Science plugin
to house it and future advanced analytics/data science aggregations.

* [DOCS] Correct `IIF` conditional section title (#45979)

* Fix typo in plugin name, add to allowed settings

* PKI realm authentication delegation (#45906)

This commit introduces PKI realm delegation. This feature
supports the PKI authentication feature in Kibana.

In essence, this creates a new API endpoint which Kibana must
call to authenticate clients that use certificates in their TLS
connection to Kibana. The API call passes to Elasticsearch the client's
certificate chain. The response contains an access token to be further
used to authenticate as the client. The client's certificates are validated
by the PKI realms that have been explicitly configured to permit
certificates from the proxy (Kibana). The user calling the delegation
API must have the delegate_pki privilege.

Closes #34396

* [ML] fixing bug where analytics process starts with 0 rows (#45879)

The native process requires that there be a non-zero number of rows to analyze. If the flag --rows 0 is passed to the executable, it throws and does not start.

When building the configuration for the process we should not start the native process if there are no rows.

Adding some logging to indicate what is occurring.

* [ML] add supported types to no fields error message (#45926)

* [ML] add supported types to no fields error message

* adding supported types to logger debug

* Range Field support for Histogram and Date Histogram aggregations(#45395)

 * Add support for a Range field ValuesSource, including decode logic for range doc values and exposing RangeType as a first class enum
 * Provide hooks in ValuesSourceConfig for aggregations to control ValuesSource class selection on missing & script values
 * Branch aggregator creation in Histogram and DateHistogram based on ValuesSource class, to enable specialization based on type.  This is similar to how Terms aggregator works.
 * Prioritize field type when available for selecting the ValuesSource class type to use for an aggregation

* [TEST] wait for search task to be cancelled in SearchRestCancellationIT (#45978)

SearchRestCancellationIT aborts an http request, and then checks that
the corresponding search task has been cancelled on the server-side.
There are no guarantees that the task has already been marked cancelled
after the `cancel` calls returns, and there is no easy wait for that.

This commit introduces an assertBusy to try and wait for the search task
to be marked cancelled.

Closes #45911

* Remove node settings from blob store repositories (#45991)

This commit starts from the simple premise that the use of node settings
in blob store repositories is a mistake. Here we see that the node
settings are used to get default settings for store and restore throttle
rates. Yet, since there are not any node settings registered to this
effect, there can never be a default setting to fall back to there, and
so we always end up falling back to the default rate. Since this was the
only use of node settings in blob store repository, we move them. From
this, several places fall out where we were chaining settings through
only to get them to the blob store repository, so we clean these up as
well. That leaves us with the changeset in this commit.

* [DOCS] Streamline GS search topic. (#45941)

* Streamline GS search topic.

* Added missing comma.

* Update docs/reference/getting-started.asciidoc 

Co-Authored-By: István Zoltán Szabó <istvan.szabo@elastic.co>

* Add test for CopyBytesSocketChannel (#45873)

Currently we use a custom CopyBytesSocketChannel for interfacing with
netty. We have integration tests that use this channel, however we never
verify the read and write behavior in the face of potential partial
writes. This commit adds a test for this behavior.

* Do not create engine under IndexShard#mutex (#45263)

Today we create new engines under IndexShard#mutex. This is not ideal
because it can block the cluster state updates which also execute under
the same mutex. We can avoid this problem by creating new engines under
a separate mutex.

Closes #43699

* Fix compilation in CumulativeCardinalityAggregatorTests (#46000)

Some generics were specified at too fine-grained a level.

* [DOCS] Streamlined GS aggs section. (#45951)


* [DOCS] Streamlined GS aggs section.

* Update docs/reference/getting-started.asciidoc

Co-Authored-By: James Rodewig <james.rodewig@elastic.co>

* Don't use assemble task on root project (#45999)

The root project uses the base plugin to get a clean task, but does not
actually need the assemble task. This commit changes the root project to
use the lifecycle-base plugin, which while still creating the assemble
task, won't add any dependencies to it.

* [DOCS] Fix typo. (#46006)

* [TEST] wait for http channels to be closed in ESIntegTestCase (#45977)

We recently added a check to `ESIntegTestCase` in order to verify that
no http channels are being tracked when we close clusters and the
REST client. Close listeners though are invoked asynchronously, hence
this check may fail if we assert before the close listener that removes
the channel from the map is invoked.

With this commit we add an `assertBusy` so we try and wait for the map
to be empty.

Closes #45914
Closes #45955

* Add `manage_own_api_key` cluster privilege (#45897)

The existing privilege model for API keys with privileges like
`manage_api_key`, `manage_security` etc. are too permissive and
we would want finer-grained control over the cluster privileges
for API keys. Previously APIs created would also need these
privileges to get its own information.

This commit adds support for `manage_own_api_key` cluster privilege
which only allows api key cluster actions on API keys owned by the
currently authenticated user. Also adds support for retrieval of
the API key self-information when authenticating via API key
without the need for the additional API key privileges.
To support this privilege, we are introducing additional
authentication context along with the request context such that
it can be used to authorize cluster actions based on the current
user authentication.

The API key get and invalidate APIs introduce an `owner` flag
that can be set to true if the API key request (Get or Invalidate)
is for the API keys owned by the currently authenticated user only.
In that case, `realm` and `username` cannot be set as they are
assumed to be the currently authenticated ones.

The changes cover HLRC changes, documentation for the API changes.

Closes #40031

* Partly revert globalInfo.ready check (#45960)

This check was introduced in #41392 but had the unwanted side-effect
that the keystore settings in such blocks would note be added in the
node's keystore. Given that we have a mid-term plan for FIPS testing
that would made such checks unnecessary, and that the conditional
in these two cases is not really that important, this change removes
this conditional logic so that full-cluster-restart and rolling
upgrade tests will run with PEM files for key/certificate material
no matter if we're in a FIPS JVM or not.

Resolves: #45475

* [ML] Add option to regression to randomize training set (#45969)

Adds a parameter `training_percent` to regression. The default
value is `100`. When the parameter is set to a value less than `100`,
from the rows that can be used for training (ie. those that have a
value for the dependent variable) we randomly choose whether to actually
use for training. This enables splitting the data into a training set and
the rest, usually called testing, validation or holdout set, which allows
for validating the model on data that have not been used for training.

Technically, the analytics process considers as training the data that
have a value for the dependent variable. Thus, when we decide a training
row is not going to be used for training, we simply clear the row's
dependent variable.

* Disallow partial results when shard unavailable (#45739)

Searching with `allowPartialSearchResults=false` could still return
partial search results during recovery. If a shard copy fails
with a "shard not available" exception, the failure would be ignored and
a partial result returned. The one case where this is known to happen
is when a shard copy is recovering when searching, since
`IllegalIndexShardStateException` is considered a "shard not available"
exception.

Relates to #42612

* [DOCS] Reformat open index API docs (#45921)

* Fix RegressionTests#fromXContent (#46029)

* The `trainingPercent` must be between `1` and `100`, not `0` and `100` which is causing test failures

* [DOCS] Separate and reformat close index API docs (#45922)

* Remove already exist assertion while renew ccr lease (#46009)

If a CCR lease is disappeared while we are renewing it, then we will
issue asyncAddRetentionLease to add that lease. And if
asyncAddRetentionLease takes longer than retentionLeaseRenewInterval,
then we can issue another asyncAddRetentionLease request. One of
asyncAddRetentionLease requests will fail with
RetentionLeaseAlreadyExistsException, hence trip the assertion.

Closes #45192

* Watcher max_iterations with foreach action execution (#45715)

Prior to this commit the foreach action execution had a hard coded 
limit to 100 iterations. This commit allows the max number of 
iterations to be a configuration ('max_iterations') on the foreach 
action. The default remains 100.

* [DOCS] Reformat update index settings API docs (#45931)

* Always add Java-9 style file permissions (#46050)

Java 9 removed pathname canonicalization, which means that we need to
add permissions for the path and also the real path when adding file
permissions. Since master requires a minimum runtime of JDK 11, we no
longer need conditional logic here to apply this pathname
canonicalization with our bares hands. This commit removes that
conditional pathname canonicalization.

* [ML][HLRC] Add data frame analytics regression analysis (#46024)

* [ML] Support boolean fields for DF analytics (#46037)

This commit adds support for `boolean` fields in data frame
analytics (and currently both outlier detection and regression).
The analytics process expects `boolean` fields to be encoded as
integers with 0 or 1 value.

* Add a few notes on Cancellable to the LLRC and HLRC docs. (#45912)

Add a section to both the low level and high level client documentation on asynchronous usage and `Cancellable` added for #44802 

Co-Authored-By: Lee Hinman <dakrone@users.noreply.github.com>

* [DOCS] [8.0] Add upgrade matrix to docs (#46027)

* [DOCS] Add index alias exists API docs (#46042)

* Few clean ups in ESBlobStoreRepositoryIntegTestCase (#46068)

* Add XContentType as parameter to HLRC ART#createServerTestInstance (#46036)

Add XContentType as parameter to the
AbstractResponseTestCase#createServerTestInstance method.

In the case a server side response class serializes xcontent as
bytes then the test needs to know what xcontent type was randomily selected.

This change is needed in #45970

* Fix rollover alias in SLM history index template (#46001)

This commit adds the `rollover_alias` setting required for ILM to work
correctly to the SLM history index template and adds assertions to the
SLM integration tests to ensure that it works correctly.

* Handle no-op document level failures (#46083)

Today we assume that document failures can not occur for no-ops. This
assumption is bogus, as they can fail for a variety of reasons such as
the Lucene index having reached the document limit. Because of this
assumption, we were asserting that such a document-level failure would
never happen. When this bogus assertion is violated, we fail the node, a
catastrophe. Instead, we need to treat this as a fatal engine exception.

* Remove plugins dir reference from docs (#46047)

While the plugin installation directory used to be settable, it has not
been so for several major versions. This commit removes a lingering
reference to the plugins directory in upgrade docs.

closes #45889

* Fix rest-api-spec dep for external plugins (#45949)

This commit fixes the maven coordinates for the rest-api-spec jar. It
was accidentally by #45107.

closes #45891

* Use float instead of double for query vectors. (#46004)

Currently, when using script_score functions like cosineSimilarity, the query
vector is treated as an array of doubles. Since the stored document vectors use
floats, it seems like the least surprising behavior for the query vectors to
also be float arrays.

In addition to improving consistency, this change may help with some
optimizations we have been considering around vector dot product.

* Add Circle Processor (#43851)

add circle-processor that translates circles to polygons

* [ML] Throw an error when a datafeed needs CCS but it is not enabled for the node (#46044)

Though we allow CCS within datafeeds, users could prevent nodes from accessing remote clusters. This can cause mysterious errors and difficult to troubleshoot.

This commit adds a check to verify that `cluster.remote.connect` is enabled on the current node when a datafeed is configured with a remote index pattern.

* Muting org.elasticsearch.client.MachineLearningIT.testEstimateMemoryUsage (#46099)

* [DOCS] Adds search-related query parameters to the common parameters. (#46057)

@szabosteve Merging so I can make some additions. Will incorporate the comments from @jrodewig.

* Move netty numDirectArenas to jvm.options (#46104)

We currently configure io.netty.allocator.numDirectArenas to be 0 in the
jvm erconomics class. This is a config that we always want to set, so it
makes sense to move it to jvm.options.

* Handle delete document level failures (#46100)

Today we assume that document failures can not occur for deletes. This
assumption is bogus, as they can fail for a variety of reasons such as
the Lucene index having reached the document limit. Because of this
assumption, we were asserting that such a document-level failure would
never happen. When this bogus assertion is violated, we fail the node, a
catastrophe. Instead, we need to treat this as a fatal engine exception.

* [DOCS] Reformats delete by query API (#46051)

* Reformats delete by query API

* Update docs/reference/docs/delete-by-query.asciidoc

Co-Authored-By: James Rodewig <james.rodewig@elastic.co>

* Updated common parms includes.

* Flush engine after big merge (#46066)

Today we might carry on a big merge uncommitted and therefore
occupy a significant amount of diskspace for quite a long time
if for instance indexing load goes down and we are not quickly
reaching the translog size threshold. This change will cause a
flush if we hit a significant merge (512MB by default) which
frees diskspace sooner.

* Docs _cat/health verification fix (#46064)

The _cat/health call in getting-started assumes that the master task max
wait time is always 0 (-), however, the test could sometimes run into a
short wait time (like some ms). Fixed to allow this.

* Do not throw an exception if the process finished quickly but without any error. (#46073)

* [DOCS] Reformats URI search request (#45844)

* [DOCS] Reformats URI search request.

Co-Authored-By: James Rodewig <james.rodewig@elastic.co>

Co-Authored-By: debadair <debadair@elastic.co>

* DOC: Update SQL docs for DbVis and Workbench/J (#45981)

Refresh the setup for the new versions of DbVisualizer and SQL
Workbench/J which have Elasticsearch JDBC support out of the box.

* Upgrade to Azure SDK 8.4.0 (#46094)

* Upgrading to 8.4.0 here which brings bulk deletes to be used in a follow up PR

* Use better matchers in AbstractSimpleTransportTestCase (#45899)

Convert most of the assertions to use Hamcrest matchers, as they give much
more context if an assertion fails.

* Refactor auditor-related classes (#45893)

* Unmute the test now that the fix for the underlying cause is merged in. (#46117)

* Replace MockAmazonS3 usage in S3BlobStoreRepositoryTests by a HTTP server (#46081)

This commit removes the usage of MockAmazonS3 in S3BlobStoreRepositoryTests 
and replaces it by a HttpServer that emulates the S3 service. This allows the 
repository tests to use the real Amazon's S3 client under the hood in tests and will 
allow to test the behavior of the snapshot/restore feature for S3 repositories by 
simulating random server-side internal errors.

The HTTP server used to emulate the S3 service is intentionally simple and minimal 
to keep things understandable and maintainable. Testing full client options on the 
server side (like authentication, chunked encoding etc) remains the responsibility 
of the AmazonS3Fixture.

* Avoid overshooting watermarks during relocation (#46079)

Today the `DiskThresholdDecider` attempts to account for already-relocating
shards when deciding how to allocate or relocate a shard. Its goal is to stop
relocating shards onto a node before that node exceeds the low watermark, and
to stop relocating shards away from a node as soon as the node drops below the
high watermark.

The decider handles multiple data paths by only accounting for relocating
shards that affect the appropriate data path. However, this mechanism does not
correctly account for _new_ relocating shards, which are unwittingly ignored.
This means that we may evict far too many shards from a node above the high
watermark, and may relocate far too many shards onto a node causing it to blow
right past the low watermark and potentially other watermarks too.

There are in fact two distinct issues that this PR fixes. New incoming shards
have an unknown data path until the `ClusterInfoService` refreshes its
statistics. New outgoing shards have a known data path, but we fail to account
for the change of the corresponding `ShardRouting` from `STARTED` to
`RELOCATING`, meaning that we fail to find the correct data path and treat the
path as unknown here too.

This PR also reworks the `MockDiskUsagesIT` test to avoid using fake data paths
for all shards. With the changes here, the data paths are handled in tests as
they are in production, except that their sizes are fake.

Fixes #45177

* AwaitsFix for #46124

* Revert "Use better matchers in AbstractSimpleTransportTestCase (#45899)"

This reverts commit 38cf581d360bdf50b1b1f1b21607887d8c91cf36.

* Revert "AwaitsFix for #46124"

This reverts commit 71ead7552df1fbdfab2c0e72015496f53b29ab20.

* [DOCS] [PUT DFA] Documents inline the child params of source and dest (#45649)

* [DOCS] [PUT DFA] Documents inline the child params of source and dest.

* [DOCS] Fixes indentation issues and amends dfa definitions.

* Only verify global checkpoint if translog sync occurred (#45980)

We only sync translog if the given offset hasn't synced yet. We can't
verify the global checkpoint from the latest translog checkpoint unless
a sync has occurred.

Closes #46065
Relates #45634

* Start testing against AdoptOpenJDK (#45666)

This commit adds AdoptOpenJDK to the testing matrix.

* [DOCS] Reformats analyze API (#45986)

* [DOCS] Add get index alias API docs (#46046)

* Validate SLM policy ids strictly (#45998)

This uses strict validation for SLM policy ids, similar to what we use
for index names.

Resolves #45997

* More Efficient Ordering of Shard Upload Execution (#42791)

* Change the upload order of of snapshots to work file by file in parallel on the snapshot pool instead of merely shard-by-shard
* Inspired by #39657

* [DOCS] Correct custom analyzer callouts (#46030)

* Rename `data-science` plugin to `analytics` (#46092)

This renames the "data-science" plugin to "analytics".
Also removes the enabled flag

* [DOCS] Separate add index alias API docs (#46086)

* [DOCS] Reformat update index aliases API docs (#46093)

* [ML] Regression dependent variable must be numeric (#46072)

* [ML] Regression dependent variable must be numeric

This adds a validation that the dependent variable of a regression
analysis must be numeric.

* Address review comments and fix some problems

In addition to addressing the review comments, this
commit fixes a few issues I found during testing.

In particular:

- if there were mappings for required fields but they were
not included we were not reporting the error
- if explicitly included fields had unsupported types we were
not reporting the error

Unfortunately, I couldn't get those fixed without refactoring
the code in `ExtractedFieldsDetector`.

* Ensure top docs optimization is fully disabled for queries with unbounded max scores. (#46105)

When a query contains a mandatory clause that doesn't track the max score per
block, we disable the max score optimization. Previously, we were doing this by
wrapping the collector with a FilterCollector that always returned
ScoreMode.COMPLETE.

However we weren't adjusting totalHitsThreshold, so the collector could still
call Scorer#setMinCompetitiveScore. It is against the method contract to call
setMinCompetitiveScore when the score mode is COMPLETE, and some scorers like
ReqOptSumScorer throw an error in this case.

This commit tries to disable the optimization by always setting
totalHitsThreshold to max int, as opposed to wrapping the collector.

* [DOCS] Add "index template exists" API docs (#46095)

* [DOCS] Add "delete index template" API docs (#46101)

* Remove classic similarity (#46078)

This commit removes the `classic` similarity from code and docs in master (8.0). The `classic` similarity cannot be used on indices created after 7.0.

Closes #46058

* Add package docs for bundled jdk location (#46153)

This commit expands the documented directory layout of the rpm and deb
packages to include the bundled jdk.

closes #45150

* bump version (#46158)

* Set netty system properties in BuildPlugin (#45881)

Currently in production instances of Elasticsearch we set a couple of
system properties by default. We currently do not apply all of these
system properties in tests. This commit applies these properties in the
tests.

* Remove insecure settings (#46147)

This commit removes the oxymoron of insecure secure settings from the
code base. In particular, we remove the ability to set the access_key
and secret_key for S3 repositories inside the repository definition (in
the cluster state). Instead, these settings now must be in the
keystore. Thus, it also removes some leniency where these settings could
be placed in the elasticsearch.yml, would not be rejected there, but
would not be consumed for any purpose.

* Inject random errors in S3BlobStoreRepositoryTests (#46125)

This commit modifies the HTTP server used in S3BlobStoreRepositoryTests 
so that it randomly returns server errors for any type of request executed by
 the SDK client. It is now possible to verify that the repository tests are s
uccessfully completed even if one or more errors were returned by the S3 
service in response of a blob upload, a blob deletion or a object listing request 
etc.

Because injecting errors forces the SDK client to retry requests, the test limits
 the maximum errors to send in response for each request at 3 retries.

* Forbid settings without a namespace (#45947)

This commit forbids settings that are not in any namespace, all setting
names must now contain a dot.

* Enhanced logging when transport is misconfigured to talk to HTTP port (#45964)

If a node is misconfigured to talk to remote node HTTP port (instead of
transport port) eventually it will receive an HTTP response from the
remote node on transport port (this happens when a node sends
accidentally line terminating byte in a transport request).
If this happens today it results in a non-friendly log message and a
long stack trace.
This commit adds a check if a malformed response is HTTP response. In
this case, a concise log message would appear.

* Fix wrong URL encoding in watcher HTTP client (#45894)

The test assumption was calling the wrong method resulting in a URL
encoding before returning the data.

Closes #44970

* Fix translog stats in testPrepareIndexForPeerRecovery (#46137)

When recovering a shard locally, we use a translog snapshot from
newSnapshotFromGen which consists of all readers from a certain
generation. In the test, we use newSnapshotFromMinSeqNo for the
expectation. The snapshot of this method includes only readers
containing operations in the requesting range.

Closes #46022

* Make Snapshot Logic Write Metadata after Segments (#45689)

* Write metadata during snapshot finalization after segment files to prevent outdated metadata in case of dynamic mapping updates as explained in #41581
* Keep the old behavior of writing the metadata beforehand in the case of mixed version clusters for BwC reasons
   * Still overwrite the metadata in the end, so even a mixed version cluster is fixed by this change if a newer version master does the finalization
* Fixes #41581

* [TEST] Mute PinnedQueryBuilderIT.testPinnedPromotions (#46175)

Relates #46174

* Move plugin.mandatory to installing plugins docs

This commit moves the plugin.mandatory settings from the plugin
directory page in the docs to the installing plugins page in the docs.

* Move plugin.mandatory to its own page

This commit takes the reworking of plugin.mandatory docs even farther by
taking this setting to its own page.

* Add test tasks for unpooled and direct buffer pooling to netty (#46049)

Some netty behavior is controlled by system properties. While we want to
test with the defaults for Elasticsearch for most tests, within netty we
want to ensure these netty settings exhibit correct behavior. This
commit adds variants of test and integTest tasks for netty which set the
unpooled and direct buffer pooled allocators.

relates #45881

* Stabilize SLM REST Tests (#46195)

Unfortunately, #42791 destabilized SLM tests because those tests use
rate limiting the snapshot write rate to a very low value globally.
Now that the various files in a snapshot get uploaded in parallel
this can lead to a few threads in parallel way overshooting the low
value throughput value used by the rate limiter and then making it
wait for minutes which times out the tests that then try to abort
the snapshot (see #21759 for details, aborting a snapshot only
happens when writing bytes to the repository).

For now the old behavior of the test from before my changes can
be restored by moving to a single threaded snapshot pool but
we should find a better way of testing the SLM behaviour here in
a follow-up.

* Clarify default behavior of auto_create_index (#46134)

Be specific about the default behaviour of `action.auto_create_index` when a list is given.

* Mute SnapshotLifeCycleIT (#46207)

Relates #46205

* Remove Unused Method from BlobStoreRepository (#46204)

This method isn't used anymore and I forgot to delete it.

* Allow ingest processors access to node client. (#46077)

This is the first PR that merges changes made to server module from
the enrich branch (see #32789) into the master branch.

The plan is to merge changes made to the server module separately from
the pr that will merge enrich into master, so that these changes can
be reviewed in isolation.

* SQL: Fix issue with DataType for CASE with NULL (#46173)

Previously, if the DataType of all the WHEN conditions of a CASE
statement is NULL, then it was set to NULL even if the ELSE clause
has a non-NULL data type, e.g.:
```
CASE WHEN a = 1 THEN NULL
           WHEN a = 5 THEN NULL
ELSE 'foo'
```

Fixes: #46032

* Mute 2 tests in S3BlobStoreRepositoryTests (#46221)

Muted testSnapshotAndRestore and testMultipleSnapshotAndRollback

Relates #46218 and #46219

* Cleanup BlobStoreRepository Abort and Failure Handling (#46208)

Aborts and failures were handled in a somewhat unfortunate way in #42791:
Since the tasks for all files are generated before uploading they are all executed when a snapshot is aborted and lead to a massive number of failures added to the original aborted exception.
In the case of failures the situation was not very reasonable as well. If one blob fails uploading the snapshot logic would upload all the remaining files as well and then fail (when previously it would just fail all following files).
I fixed both of the above issues, by just short-circuiting all remaining tasks for a shard in case of an exception in any one upload.

* Test fix for PinnedQueryBuilderIT (#46187)

Fix test issue to stabilise scoring through use of DFS search mode.
Randomised index-then-delete docs introduced by the test framework likely caused an imbalance in IDF scores across shards. Also made number of shards used in test a random number for added test coverage.

Closes #46174

* Wait for all Rec. to Stop on Node Close (#46178)

* Wait for all Rec. to Stop on Node Close

* This issue is in the `RecoverySourceHandler#acquireStore`. If we submit the store release to the generic threadpool while it is getting shut down we never complete the futue we wait on (in the generic pool as well) and fail to ever release the store potentially.
* Fixed by waiting for all recoveries to end on node close so that we aways have a healthy thread pool here
* Closes #45956

*  Disable request throttling in S3BlobStoreRepositoryTests (#46226)

When some high values are randomly picked up - for example the number 
of indices to snapshot or the number of snapshots to create - the tests in S3BlobStoreRepositoryTests can generate a high number of requests to 
the internal S3 server.

In order to test the retry logic of the S3 client, the internal server is 
designed to randomly generate random server errors. When many
 requests are made, it is possible that the S3 client reaches its maximum 
number of successive retries capacity. Then the S3 client will stop 
retrying requests until enough retry attempts succeed, but it means 
that any request could fail before reaching the max retries count and 
make the test fail too.

Closes #46217
Closes #46218
Closes #46219

* Sync translog without lock when trim unreferenced readers (#46203)

With this change, we can avoid blocking writing threads when trimming
unreferenced readers; hence improving the translog writing performance
in async durability mode.

Close #46201

* Add debug assertions for userhome not existing (#46206)

The elasticsearch user should not have a homedir, yet we have seen this
particular test fail rather frequently with a failed check that the
userhome does not exist. This commit adds some additional assertions on
the presumptive userhome to narrow down where it might be created.

relates #45903

* Remove duplicate line in SearchAfterBuilder (#45994)

* reset queryGeometry in ShapeQueryTests (#45974)

* [ML-DataFrame] Fix off-by-one error in checkpoint operations_behind (#46235)

Fixes a problem where operations_behind would be one less than
expected per shard in a new index matched by the data frame
transform source pattern.

For example, if a data frame transform had a source of foo*
and a new index foo-new was created with 2 shards and 7 documents
indexed in it then operations_behind would be 5 prior to this
change.

The problem was that an empty index has a global checkpoint
number of -1 and the sequence number of the first document that
is indexed into an index is 0, not 1.  This doesn't matter for
indices included in both the last and next checkpoints, as the
off-by-one errors cancelled, but for a new index it affected
the observed result.

* Fixed synchronizing REST API inflight breaker names with internal variable (#40878)

The internal configuration settings were like that: network.breaker.inflight_requests
But the exposed REST API had the value names with underscore like that: network.breaker.in_flight_requests
This was now corrected to without underscores like that: network.breaker.inflight_requests

* [DOCS] Add delete index alias API docs (#46080)

* [ML][Transforms] fixing stop on changes check bug (#46162)

* [ML][Transforms] fixing stop on changes check bug

* Adding new method finishAndCheckState to cover race conditions in early terminations

* changing stopping conditions in `onStart`

* allow indexer to finish when exiting early

* Fix testSyncFailsIfOperationIsInFlight (#46269)

testSyncFailsIfOperationIsInFlight could fail due to the index request
spawing a GCP sync (new since 7.4). Test now waits for it to finish
before testing that flushed sync fails.

* [ML] Unmute testStopOutlierDetectionWithEnoughDocumentsToScroll (#46271)

The test seems to have been failing due to a race condition between
stopping the task and refreshing the destination index. In particular,
we were going forward with refreshing the destination index even
though the task stopped in the meantime. This was fixed in
request.

Closes #43960

* [ML][Transforms] protecting doSaveState with optimistic concurrency (#46156)

* [ML][Transforms] protecting doSaveState with optimistic concurrency

* task code cleanup

* Suppress warning from background sync on relocated primary (#46247)

If a primary as being relocated, then the global checkpoint and
retention lease background sync can emit unnecessary warning logs.
This side effect was introduced in #42241.

Relates #40800
Relates #42241

* Add CumulativeCard pipeline agg to pipeline index (#46279)

The Cumulative Cardinality docs weren't linked
from the pipeline index page

* Add more assertions and cleanup to setup passwords tests (#46289)

This commit is a followup to #46206 to continue debugging failures in an
elasticsearch homedir being created. A couple more assertions are added
as well as a final cleanup at the end of the previous test to the one
that fails.

* Multi-get requests should wait for search active (#46283)

When a shard has fallen search idle, and a non-realtime multi-get
request is executed, today such requests do not wait for the shard to
become search active and therefore such requests do not wait for a
refresh to see the latest changes to the index. This also prevents such
requests from triggering the shard as non-search idle, influencing the
behavior of scheduled refreshes. This commit addresses this by attaching
a listener to the shard search active state for multi-get requests. In
this way, when the next scheduled refresh is executed, the multi-get
request will then proceed.

* [ML][Transforms] fixing listener being called twice (#46284)

* Mute testRecoveryFromFailureOnTrimming

Tracked at #46267

* Move MockRespository into test framework (#46298)

This moves the `MockRespository` class into `test/framework/src/main` so
it can be used across all modules and plugins in tests.

* First round of optimizations for vector functions. (#46294)

This PR merges the `vectors-optimize-brute-force` feature branch, which makes
the following changes to how vector functions are computed:
* Precompute the L2 norm of each vector at indexing time. (#45390)
* Switch to ByteBuffer for vector encoding. (#45936)
* Decode vectors and while computing the vector function. (#46103) 
* Use an array instead of a List for the query vector. (#46155)
* Precompute the normalized query vector when using cosine similarity. (#46190)

Co-authored-by: Mayya Sharipova <mayya.sharipova@elastic.co>

* Initialize document subset bit set cache used for DLS (#46211)

This commit initializes DocumentSubsetBitsetCache even if DLS
is disabled. Previously it would throw null pointer when querying
usage stats if we explicitly disabled DLS as there would be no instance of DocumentSubsetBitsetCache to query. It is okay to initialize
DocumentSubsetBitsetCache which will be empty as the license enforcement
would prevent usage of DLS feature and it will not fail when accessing usage stats.

Closes #45147

* [ML-DataFrame] unmute tests for debuging purposes (#46121)

unmute testGetCheckpointStats

closes #45238

* SQL: Fix issue with IIF function when condition folds (#46290)

Previously, when the condition (1st argument) of the IIF function could
be evaluated (folded) to false, the `IfConditional` was eliminated which
caused `IndexOutOfBoundsException` to be thrown when `info()` and
`resolveType()` methods where called.

Fixes: #46268

* [DOCS] Reformats multi search API (#46256)

* [DOCS] Reformats multi search API.

Co-Authored-By: James Rodewig <james.rodewig@elastic.co>

* Remove stack trace logging in Security(Transport|Http)ExceptionHandler (#45966)

As per #45852 comment we no longer need to log stack-traces in
SecurityTransportExceptionHandler and SecurityHttpExceptionHandler even
if trace logging is enabled.

* [DOCS] Reformats request body search API (#46254)

* [DOCS] Reformats request body search API.
Co-Authored-By: James Rodewig <james.rodewig@elastic.co>

* Reenable+Fix testMasterShutdownDuringFailedSnapshot (#46303)

Reenable this test since it was fixed by #45689 in production
code (specifically, the fact that we write the `snap-` blobs
without overwrite checks now).
Only required adding the assumed blocking on index file writes
to test code to properly work again.

* Closes #25281

* DOCS Link to kib reference from es reference on PKI authn (#46260)

* Quote the task name in reproduction line printer (#46266)

Some tasks have `#` for instance that doesn't play well with some shells
( e.x. zsh )

* [DOCS] Reformats search shards API (#46240)

* [DOCS] Reformats search shards API
Co-Authored-By: James Rodewig <james.rodewig@elastic.co>

* Fix SearchService.createContext exception handling (#46258)

An exception from the DefaultSearchContext constructor could leak a
searcher, causing future issues like shard lock obtained exceptions. The
underlying cause of the exception in the constructor has been fixed, but
as a safety precaution we also fix the exception handling in
createContext.

Closes #45378

* Bwc testclusters all (#46265)

Convert all bwc projects to testclusters

* Adjacency_matrix aggregation optimisation. (#46257)

Avoid pre-allocating ((N * N) - N) / 2 “BitsIntersector” objects given N filters.
Most adjacency matrices will be sparse and we typically don’t need to allocate all of these objects - can save a lot of allocations when the number of filters is high.

Closes #46212

* [DOCS] Reformats search template and multi search template APIs (#46236)

* [DOCS] Reformats search template and multi search template APIs.
Co-Authored-By: James Rodewig <james.rodewig@elastic.co>

* Improve documentation for X-Opaque-ID (#46167)

this field can be present in search slow logs and deprecation logs. The
docs describes how to enable this functionality and what expect in logs.
closes #44851

* [DOCS] Add "get index template" API docs (#46296)

* Do not send recovery requests with CancellableThreads (#46287)

Previously, we send recovery requests using CancellableThreads because
we send requests and wait for responses in a blocking manner. With async
recovery, we no longer need to do so. Moreover, if we fail to submit a
request, then we can release the Store using an interruptible thread
which can risk invalidating the node lock.

This PR is the first step to avoid forking when releasing the Store.

Relates #45409
Relates #46178

* Build: Enable testing without magic comments (#46180)

Previously we only turned on tests if we saw either `// CONSOLE` or
`// TEST`. These magic comments are difficult for the docs build to deal
with so it has moved away from using them where possible. We should
catch up. This adds another trigger to enable testing: marking a snippet
with the `console` language. It looks like this:

```
[source,console]
----
GET /
----
```

This saves a line which is nice, I guess. But it is more important to me
that this is consistent with the way the docs build works now.

Similarly this enables response testing when you mark a snippet with the
language `console-result`. That looks like:
```
[source,console-result]
----
{
  "result": "0.1"
}
----
```

`// TESTRESPONSE` is still available for situations like `// TEST`: when
the response isn't *in* the console-result language (like `_cat`) or
when you want to perform substitutions on the generated test.

Should unblock #46159.

* Docs for translog, history retention and flushing (#46245)

This commit updates the docs about translog retention and flushing to reflect
recent changes in how peer recoveries work. It also adds some docs to describe
how history is retained for replay using soft deletes and shard history
retention leases.

Relates #45473

* [DOCS] Reformat "put index template" API docs (#46297)

* Add test that get triggers shard search active (#46317)

This commit is a follow-up to a change that fixed that multi-get was not
triggering a shard to become search active. In that change, we added a
test that multi-get properly triggers a shard to become search
active. This commit is a follow-up to that change which adds a test for
the get case. While get is already handled correctly in production code,
there was not a test for it. This commit adds one. Additionally, we
factor all the search idle tests from IndexShardIT into a separate test
class, as an effort to keep related tests together instead of a single
large test class containing a jumble of tests, and also to keep test
classes smaller for better parallelization.

* Document support of OIDC Implicit flow in Kibana. (#45693)

* [DOCS] Replace "// CONSOLE" comments with [source,console] (#46159)

* [DOCS] Identify reloadable EC2 Discovery Plugin settings (#46102)

* [ML] testFullClusterRestart waiting for stable cluster (#46280)

* [ML] waiting for ml indices before waiting task assignment testFullClusterRestart

* waiting for a stable cluster after fullrestart

* removing unused imports

* [ML][Transforms] fixing rolling upgrade continuous transform test (#45823)

* [ML][Transforms] fixing rolling upgrade continuous transform test

* adjusting wait assert logic

* adjusting wait conditions

* muting test (#46343)

* Decouple shard allocation awareness from search and get requests (#45735)

With this commit, Elasticsearch will no longer prefer using shards in the same location
(with the same awareness attribute values) to process `_search` and `_get` requests.
Instead, adaptive replica selection (the default since 7.0) should route requests more efficiently
using the service time of prior inter-node communications. Clusters with big latencies between
nodes should switch to cross cluster replication to isolate nodes within the same zone.
Note that this change only targets 8.0 since it is considered as breaking. However a follow up
pr should add an option to activate this behavior in 7.x in order to allow users to opt-in early.

Closes #43453

* Revert "Sync translog without lock when trim unreferenced readers (#46203)"

Unfortunately, with this change, we won't clean up all unreferenced
generations when reopening. We assume that there's at most one
unreferenced generation when reopening translog. The previous
implementation guarantees this assumption by syncing translog every time
after we remove a translog reader. This change, however, only syncs
translog once after we have removed all unreferenced readers (can be
more than one) and breaks the assumption.

Closes #46267

This reverts commit fd8183ee51d7cf08d9def58a2ae027714beb60de.

* [DOCS] Identify reloadable S3 repository plugin settings (#46349)

* Unmute testRecoveryFromFailureOnTrimming

Tracked at #46267

* [DOCS] Identify reloadable GCS repository plugin settings (#46352)

* [DOCS] Synchs Watcher API titles with better HLRC titles (#46328)

* Add repository integration tests for Azure (#46263)

Similarly to what had been done for S3 (#46081) and GCS (#46255) 
this commit adds repository integration tests for Azure, based on an 
internal HTTP server instead of mocks.

* Replace mocked client in GCSBlobStoreRepositoryTests by HTTP server (#46255)

This commit removes the usage of MockGoogleCloudStoragePlugin in 
GoogleCloudStorageBlobStoreRepositoryTests and replaces it by a 
HttpServer that emulates the Storage service. This allows the repository 
tests to use the real Google's client under the hood in tests and will allow 
us to test the behavior of the snapshot/restore feature for GCS repositories 
by simulating random server-side internal errors.

The HTTP server used to emulate the Storage service is intentionally simple 
and minimal to keep things understandable and maintainable. Testing full 
client options on the server side (like authentication, chunked encoding 
etc) remains the responsibility of the GoogleCloudStorageFixture.

* Mute failing SamlAuthenticationIT tests (#46369)

see #44410

* Enable Debug Logging for Master and Coordination Packages (#46363)

In order to track down #46091:
* Enables debug logging in REST tests for `master` and `coordination` packages
since we suspect that issues are caused by failed and then retried publications

* Quiet down shard lock failures (#46368)

These were actually never intended to be logged at the warning level but made visible by a refactoring in #19991, which introduced a new exception type but forgot to adapt some of the consumers of the exception.

* [ML][Transforms] allow executor to call start on started task (#46347)

* [DOCS] Reformat index segments API docs (#46345)

* [DOCS] Re-add versioning to put template docs (#46384)

Adds documentation for index template versioning
accidentally removed with #46297.

* [ML][Transforms] update supported aggs docs (#46388)

* Support geotile_grid aggregation in composite agg sources (#45810)

Adds support for `geotile_grid` as a source in composite aggs. 

Part of this change includes adding a new docFormat of `GEOTILE` that formats a hashed `long` value into a geotile formatting string `zoom/x/y`.

* Refactor AllocatedPersistentTask#init(), move rollup logic out of ctor (#46288)

This makes the AllocatedPersistentTask#init() method protected so that
implementing classes can perform their initialization logic there,
instead of the constructor.  Rollup's task is adjusted to use this
init method.

It also slightly refactors the methods to se a static logger in the 
AllocatedTask instead of passing it in via an argument.  This is 
simpler, logged messages come from the task instead of the 
service, and is easier for tests

* [DOCS] Update snippets in security APIs (#46191)

* [DOCS] Identify reloadable Azure repository plugin settings (#46358)

* [DOCS] Reformats Watcher APIs using template (#46152)

* Add docs on upgrading the keystore (#46331)

This commit adds a note to the docs regarding upgrading the keystore.

* [ML] Fixing instance serialization version for bwc (#46403)

* [DOCS] Reformat index stats API docs (#46322)

* Adjusting bwc serialization after backport (#46400)

* Clarify error message on keystore write permissions (#46321)

When the Elasticsearch process does not have write permissions to
upgrade the Elasticsearch keystore, we bail with an error message that
indicates there is a filesystem permissions problem. This commit
clarifies that error message by pointing out the directory where write
permissions are required, or that the user can also run the
elasticsearch-keystore upgrade command manually before starting the
Elasticsearch process. In this case, the upgrade would not be needed at
runtime, so the permissions would not be needed then.

* Revert "Refactor AllocatedPersistentTask#init(), move rollup logic out of ctor (#46288)"

This reverts commit d999942c6dfd931266d01db24d3fb26b29cf8f64.

* reuse mock client to avoid probles with thread context closed errors (#46398)

* [DOCS] Replace "// TESTRESPONSE" magic comments with "[source,console-result] (#46295)

* [ML-DataFrame] improve error message for timeout case in stop (#46131)

improve error message if stopping of transform times out.

related #45610

* Fix usage of randomIntBetween() in testWriteBlobWithRetries (#46380)

This commit fixes the usage of randomIntBetween() in the test 
testWriteBlobWithRetries, when the test generates a random array  
of a single byte.

* cleanup static member

* Resolve the incorrect scroll_current when delete or close index (#45226)

Resolve the incorrect current scroll for deleted or closed index

* [ML] Extract DataFrameAnalyticsTask into its own class (#46402)

This refactors `DataFrameAnalyticsTask` into its own class.
The task has quite a lot of functionality now and I believe it would
make code more readable to have it live as its own class rather than
an inner class of the start action class.

* Mute CcrRollingUpgradeIT.testUniDirectionalIndexFollowing and testUniDirectionalIndexFollowing (#46429)

Relates #46416

* Mute SSLClientAuthTests.testThatHttpFailsWithoutSslClientAuth()

Tracked in #46230

* Add yet more logging around index creation (#46431)

Further investigation into #46091, expanding on #46363, to add even more
detailed logging around the retry behaviour during index creation.

* [Transform] simplify class structure of indexer (#46306)

simplify transform task and indexer

 - remove redundant transform id
 - moving client data frame indexer (and builder) into a separate file

* [ML] Tolerate total_search_time_ms not mapped in get datafeed stats (#46432)

ML users who upgrade from versions prior to 7.4 to 7.4 or later
will have ML results indices that do not have mappings for the
total_search_time_ms field.  Therefore, when searching these
indices we must tolerate this field not having a mapping.

Fixes #46437

* [DOCS] Adds progress parameter description to the GET stats data frame analytics API doc. (#46434)

* [DOCS] Resort common-parms (#46419)

* [DOCS] Change // CONSOLE comments to [source,console] (#46441)

* [DOCS] Add index alias definition to glossary (#46339)

* [Docs] Fix typo in field-names-field.asciidoc (#46430)

* [DOCS] [5 of 5] Change // TESTRESPONSE comments to [source,console-results] (#46449)

* [DOCS] Correct definition for `allow_no_indices` parameter (#46450)

* Increase REST-Test Client Timeout to 60s (#46455)

We are seeing requests take more than the default 30s
which leads to requests being retried and returning
unexpected failures like e.g. "index already exists"
because the initial requests that timed out, worked
out functionally anyway.
=> double the timeout to reduce the likelihood of
the failures described in #46091
=> As suggested in the issue, we should in a follow-up
turn off retrying all-together probably

* [DOCS] Remove cat request from Index Segments API requests (#46463)

* SQL: fix scripting for grouped by datetime functions (#46421)

* Fix issue with painless scripting not being correctly generated when
datetime functions are used for GROUPing of an INTERVAL operation.

* Use `null` schema response for `SYS TABLES` command. (#46386)

* Ignore replication for noop updates (#46458)

Previously, we ignore replication for noop updates because they do not
have sequence numbers. Since #44603, we started assigning sequence
numbers to noop updates leading them to be replicated to replicas.

This bug occurs only on 8.0 for it requires #41065 and #44603.

Closes #46366

* Strengthen testUpdate in rolling upgrade

We hit a bug where we can't partially update documents created in a
mixed cluster between 5.x and 6.x. Although this bug does not affect
7.0 or later, we should have a good test that catches this issue.

Relates #46198
@dakrone dakrone closed this as completed in 56aabcd Sep 9, 2019
dakrone added a commit to dakrone/elasticsearch that referenced this issue Sep 9, 2019
This commit adds retention to the existing Snapshot Lifecycle Management feature (elastic#38461) as described in elastic#43663. This allows a user to configure SLM to automatically delete older snapshots based on a number of criteria.

An example policy would look like:

```
PUT /_slm/policy/snapshot-every-day
{
  "schedule": "0 30 2 * * ?",
  "name": "<production-snap-{now/d}>",
  "repository": "my-s3-repository",
  "config": {
    "indices": ["foo-*", "important"]
  },
  // Newly configured retention options
  "retention": {
    // Snapshots should be deleted after 14 days
    "expire_after": "14d",
    // Keep a maximum of thirty snapshots
    "max_count": 30,
    // Keep a minimum of the four most recent snapshots
    "min_count": 4
  }
}
```

SLM Retention is run on a scheduled configurable with the `slm.retention_schedule` setting, which supports cron expressions. Deletions are run for a configurable time bounded by the `slm.retention_duration` setting, which defaults to 1 hour.

Included in this work is a new SLM stats API endpoint available through

``` json
GET /_slm/stats
```

That returns statistics about snapshot taken and deleted, as well as successful retention runs, failures, and the time spent deleting snapshots. elastic#45362 has more information as well as an example of the output. These stats are also included when retrieving SLM policies via the API.

* Add base framework for snapshot retention (elastic#43605)

* Add base framework for snapshot retention

This adds a basic `SnapshotRetentionService` and `SnapshotRetentionTask`
to start as the basis for SLM's retention implementation.

Relates to elastic#38461

* Remove extraneous 'public'

* Use a local var instead of reading class var repeatedly

* Add SnapshotRetentionConfiguration for retention configuration (elastic#43777)

* Add SnapshotRetentionConfiguration for retention configuration

This commit adds the `SnapshotRetentionConfiguration` class and its HLRC
counterpart to encapsulate the configuration for SLM retention.
Currently only a single parameter is supported as an example (we still
need to discuss the different options we want to support and their
names) to keep the size of the PR down. It also does not yet include version serialization checks
since the original SLM branch has not yet been merged.

Relates to elastic#43663

* Fix REST tests

* Fix more documentation

* Use Objects.equals to avoid NPE

* Put `randomSnapshotLifecyclePolicy` in only one place

* Occasionally return retention with no configuration

* Implement SnapshotRetentionTask's snapshot filtering and delet… (elastic#44764)

* Implement SnapshotRetentionTask's snapshot filtering and deletion

This commit implements the snapshot filtering and deletion for
`SnapshotRetentionTask`. Currently only the expire-after age is used for
determining whether a snapshot is eligible for deletion.

Relates to elastic#43663

* Fix deletes running on the wrong thread

* Handle missing or null policy in snap metadata differently

* Convert Tuple<String, List<SnapshotInfo>> to Map<String, List<SnapshotInfo>>

* Use the `OriginSettingClient` to work with security, enhance logging

* Prevent NPE in test by mocking Client

* Allow empty/missing SLM retention configuration (elastic#45018)

Semi-related to elastic#44465, this allows the `"retention"` configuration map
to be missing.

Relates to elastic#43663

* Add min_count and max_count as SLM retention predicates (elastic#44926)

This adds the configuration options for `min_count` and `max_count` as
well as the logic for determining whether a snapshot meets this criteria
to SLM's retention feature.

These options are optional and one, two, or all three can be specified
in an SLM policy.

Relates to elastic#43663

* Time-bound deletion of snapshots in retention delete function (elastic#45065)

* Time-bound deletion of snapshots in retention delete function

With a cluster that has a large number of snapshots, it's possible that
snapshot deletion can take a very long time (especially since deletes
currently have to happen in a serial fashion). To prevent snapshot
deletion from taking forever in a cluster and blocking other operations,
this commit adds a setting to allow configuring a maximum time to spend
deletion snapshots during retention. This dynamic setting defaults to 1
hour and is best-effort, meaning that it doesn't hard stop a deletion
at an hour mark, but ensures that once the time has passed, all
subsequent deletions are deferred until the next retention cycle.

Relates to elastic#43663

* Wow snapshots suuuure can take a long time.

* Use a LongSupplier instead of actually sleeping

* Remove TestLogging annotation

* Remove rate limiting

* Add SLM metrics gathering and endpoint (elastic#45362)

* Add SLM metrics gathering and endpoint

This commit adds the infrastructure to gather metrics about the different SLM actions that a cluster
takes. These actions are stored in `SnapshotLifecycleStats` and perpetuated in cluster state. The
stats stored include the number of snapshots taken, failed, deleted, the number of retention runs,
as well as per-policy counts for snapshots taken, failed, and deleted. It also includes the amount
of time spent deleting snapshots from SLM retention.

This commit also adds an endpoint for retrieving all stats (further commits will expose this in the
SLM get-policy API) that looks like:

```
GET /_slm/stats
{
  "retention_runs" : 13,
  "retention_failed" : 0,
  "retention_timed_out" : 0,
  "retention_deletion_time" : "1.4s",
  "retention_deletion_time_millis" : 1404,
  "policy_metrics" : {
    "daily-snapshots2" : {
      "snapshots_taken" : 7,
      "snapshots_failed" : 0,
      "snapshots_deleted" : 6,
      "snapshot_deletion_failures" : 0
    },
    "daily-snapshots" : {
      "snapshots_taken" : 12,
      "snapshots_failed" : 0,
      "snapshots_deleted" : 12,
      "snapshot_deletion_failures" : 6
    }
  },
  "total_snapshots_taken" : 19,
  "total_snapshots_failed" : 0,
  "total_snapshots_deleted" : 18,
  "total_snapshot_deletion_failures" : 6
}
```

This does not yet include HLRC for this, as this commit is quite large on its own. That will be
added in a subsequent commit.

Relates to elastic#43663

* Version qualify serialization

* Initialize counters outside constructor

* Use computeIfAbsent instead of being too verbose

* Move part of XContent generation into subclass

* Fix REST action for master merge

* Unused import

*  Record history of SLM retention actions (elastic#45513)

This commit records the deletion of snapshots by the retention component
of SLM into the SLM history index for the purposes of reviewing operations
taken by SLM and alerting.

* Retry SLM retention after currently running snapshot completes (elastic#45802)

* Retry SLM retention after currently running snapshot completes

This commit adds a ClusterStateObserver to wait until the currently
running snapshot is complete before proceeding with snapshot deletion.
SLM retention waits for the maximum allowed deletion time for the
snapshot to complete, however, the waiting time is not factored into
the limit on actual deletions.

Relates to elastic#43663

* Increase timeout waiting for snapshot completion

* Apply patch

From https://github.com/original-brownbear/elasticsearch/commit/2374316f0d1912c9e1498bece195546a1dc60bce.patch

* Rename test variables

* [TEST] Be less strict for stats checking

* Skip SLM retention if ILM is STOPPING or STOPPED (elastic#45869)

This adds a check to ensure we take no action during SLM retention if
ILM is currently stopped or in the process of stopping.

Relates to elastic#43663

* Check all actions preventing snapshot delete during retention (elastic#45992)

* Check all actions preventing snapshot delete during retention run

Previously we only checked to see if a snapshot was currently running,
but it turns out that more things can block snapshot deletion. This
changes the check to be a check for:

- a snapshot currently running
- a deletion already in progress
- a repo cleanup in progress
- a restore currently running

This was found by CI where a third party delete in a test caused SLM
retention deletion to throw an exception.

Relates to elastic#43663

* Add unit test for okayToDeleteSnapshots

* Fix bug where SLM retention task would be scheduled on every node

* Enhance test logging

* Ignore if snapshot is already deleted

* Missing import

* Fix SnapshotRetentionServiceTests

* Expose SLM policy stats in get SLM policy API (elastic#45989)

This also adds support for the SLM stats endpoint to the high level rest client.

Retrieving a policy now looks like:

```json
{
  "daily-snapshots" : {
    "version": 1,
    "modified_date": "2019-04-23T01:30:00.000Z",
    "modified_date_millis": 1556048137314,
    "policy" : {
      "schedule": "0 30 1 * * ?",
      "name": "<daily-snap-{now/d}>",
      "repository": "my_repository",
      "config": {
        "indices": ["data-*", "important"],
        "ignore_unavailable": false,
        "include_global_state": false
      },
      "retention": {}
    },
    "stats": {
      "snapshots_taken": 0,
      "snapshots_failed": 0,
      "snapshots_deleted": 0,
      "snapshot_deletion_failures": 0
    },
    "next_execution": "2019-04-24T01:30:00.000Z",
    "next_execution_millis": 1556048160000
  }
}
```

Relates to elastic#43663

* Rewrite SnapshotLifecycleIT as as ESIntegTestCase (elastic#46356)

* Rewrite SnapshotLifecycleIT as as ESIntegTestCase

This commit splits `SnapshotLifecycleIT` into two different tests.
`SnapshotLifecycleRestIT` which includes the tests that do not require
slow repositories, and `SLMSnapshotBlockingIntegTests` which is now an
integration test using `MockRepository` to simulate a snapshot being in
progress.

Relates to elastic#43663
Resolves elastic#46205

* Add error logging when exceptions are thrown
dakrone added a commit that referenced this issue Sep 10, 2019
* Add retention to Snapshot Lifecycle Management (#46407)

This commit adds retention to the existing Snapshot Lifecycle Management feature (#38461) as described in #43663. This allows a user to configure SLM to automatically delete older snapshots based on a number of criteria.

An example policy would look like:

```
PUT /_slm/policy/snapshot-every-day
{
  "schedule": "0 30 2 * * ?",
  "name": "<production-snap-{now/d}>",
  "repository": "my-s3-repository",
  "config": {
    "indices": ["foo-*", "important"]
  },
  // Newly configured retention options
  "retention": {
    // Snapshots should be deleted after 14 days
    "expire_after": "14d",
    // Keep a maximum of thirty snapshots
    "max_count": 30,
    // Keep a minimum of the four most recent snapshots
    "min_count": 4
  }
}
```

SLM Retention is run on a scheduled configurable with the `slm.retention_schedule` setting, which supports cron expressions. Deletions are run for a configurable time bounded by the `slm.retention_duration` setting, which defaults to 1 hour.

Included in this work is a new SLM stats API endpoint available through

``` json
GET /_slm/stats
```

That returns statistics about snapshot taken and deleted, as well as successful retention runs, failures, and the time spent deleting snapshots. #45362 has more information as well as an example of the output. These stats are also included when retrieving SLM policies via the API.

* Add base framework for snapshot retention (#43605)

* Add base framework for snapshot retention

This adds a basic `SnapshotRetentionService` and `SnapshotRetentionTask`
to start as the basis for SLM's retention implementation.

Relates to #38461

* Remove extraneous 'public'

* Use a local var instead of reading class var repeatedly

* Add SnapshotRetentionConfiguration for retention configuration (#43777)

* Add SnapshotRetentionConfiguration for retention configuration

This commit adds the `SnapshotRetentionConfiguration` class and its HLRC
counterpart to encapsulate the configuration for SLM retention.
Currently only a single parameter is supported as an example (we still
need to discuss the different options we want to support and their
names) to keep the size of the PR down. It also does not yet include version serialization checks
since the original SLM branch has not yet been merged.

Relates to #43663

* Fix REST tests

* Fix more documentation

* Use Objects.equals to avoid NPE

* Put `randomSnapshotLifecyclePolicy` in only one place

* Occasionally return retention with no configuration

* Implement SnapshotRetentionTask's snapshot filtering and delet… (#44764)

* Implement SnapshotRetentionTask's snapshot filtering and deletion

This commit implements the snapshot filtering and deletion for
`SnapshotRetentionTask`. Currently only the expire-after age is used for
determining whether a snapshot is eligible for deletion.

Relates to #43663

* Fix deletes running on the wrong thread

* Handle missing or null policy in snap metadata differently

* Convert Tuple<String, List<SnapshotInfo>> to Map<String, List<SnapshotInfo>>

* Use the `OriginSettingClient` to work with security, enhance logging

* Prevent NPE in test by mocking Client

* Allow empty/missing SLM retention configuration (#45018)

Semi-related to #44465, this allows the `"retention"` configuration map
to be missing.

Relates to #43663

* Add min_count and max_count as SLM retention predicates (#44926)

This adds the configuration options for `min_count` and `max_count` as
well as the logic for determining whether a snapshot meets this criteria
to SLM's retention feature.

These options are optional and one, two, or all three can be specified
in an SLM policy.

Relates to #43663

* Time-bound deletion of snapshots in retention delete function (#45065)

* Time-bound deletion of snapshots in retention delete function

With a cluster that has a large number of snapshots, it's possible that
snapshot deletion can take a very long time (especially since deletes
currently have to happen in a serial fashion). To prevent snapshot
deletion from taking forever in a cluster and blocking other operations,
this commit adds a setting to allow configuring a maximum time to spend
deletion snapshots during retention. This dynamic setting defaults to 1
hour and is best-effort, meaning that it doesn't hard stop a deletion
at an hour mark, but ensures that once the time has passed, all
subsequent deletions are deferred until the next retention cycle.

Relates to #43663

* Wow snapshots suuuure can take a long time.

* Use a LongSupplier instead of actually sleeping

* Remove TestLogging annotation

* Remove rate limiting

* Add SLM metrics gathering and endpoint (#45362)

* Add SLM metrics gathering and endpoint

This commit adds the infrastructure to gather metrics about the different SLM actions that a cluster
takes. These actions are stored in `SnapshotLifecycleStats` and perpetuated in cluster state. The
stats stored include the number of snapshots taken, failed, deleted, the number of retention runs,
as well as per-policy counts for snapshots taken, failed, and deleted. It also includes the amount
of time spent deleting snapshots from SLM retention.

This commit also adds an endpoint for retrieving all stats (further commits will expose this in the
SLM get-policy API) that looks like:

```
GET /_slm/stats
{
  "retention_runs" : 13,
  "retention_failed" : 0,
  "retention_timed_out" : 0,
  "retention_deletion_time" : "1.4s",
  "retention_deletion_time_millis" : 1404,
  "policy_metrics" : {
    "daily-snapshots2" : {
      "snapshots_taken" : 7,
      "snapshots_failed" : 0,
      "snapshots_deleted" : 6,
      "snapshot_deletion_failures" : 0
    },
    "daily-snapshots" : {
      "snapshots_taken" : 12,
      "snapshots_failed" : 0,
      "snapshots_deleted" : 12,
      "snapshot_deletion_failures" : 6
    }
  },
  "total_snapshots_taken" : 19,
  "total_snapshots_failed" : 0,
  "total_snapshots_deleted" : 18,
  "total_snapshot_deletion_failures" : 6
}
```

This does not yet include HLRC for this, as this commit is quite large on its own. That will be
added in a subsequent commit.

Relates to #43663

* Version qualify serialization

* Initialize counters outside constructor

* Use computeIfAbsent instead of being too verbose

* Move part of XContent generation into subclass

* Fix REST action for master merge

* Unused import

*  Record history of SLM retention actions (#45513)

This commit records the deletion of snapshots by the retention component
of SLM into the SLM history index for the purposes of reviewing operations
taken by SLM and alerting.

* Retry SLM retention after currently running snapshot completes (#45802)

* Retry SLM retention after currently running snapshot completes

This commit adds a ClusterStateObserver to wait until the currently
running snapshot is complete before proceeding with snapshot deletion.
SLM retention waits for the maximum allowed deletion time for the
snapshot to complete, however, the waiting time is not factored into
the limit on actual deletions.

Relates to #43663

* Increase timeout waiting for snapshot completion

* Apply patch

From https://github.com/original-brownbear/elasticsearch/commit/2374316f0d1912c9e1498bece195546a1dc60bce.patch

* Rename test variables

* [TEST] Be less strict for stats checking

* Skip SLM retention if ILM is STOPPING or STOPPED (#45869)

This adds a check to ensure we take no action during SLM retention if
ILM is currently stopped or in the process of stopping.

Relates to #43663

* Check all actions preventing snapshot delete during retention (#45992)

* Check all actions preventing snapshot delete during retention run

Previously we only checked to see if a snapshot was currently running,
but it turns out that more things can block snapshot deletion. This
changes the check to be a check for:

- a snapshot currently running
- a deletion already in progress
- a repo cleanup in progress
- a restore currently running

This was found by CI where a third party delete in a test caused SLM
retention deletion to throw an exception.

Relates to #43663

* Add unit test for okayToDeleteSnapshots

* Fix bug where SLM retention task would be scheduled on every node

* Enhance test logging

* Ignore if snapshot is already deleted

* Missing import

* Fix SnapshotRetentionServiceTests

* Expose SLM policy stats in get SLM policy API (#45989)

This also adds support for the SLM stats endpoint to the high level rest client.

Retrieving a policy now looks like:

```json
{
  "daily-snapshots" : {
    "version": 1,
    "modified_date": "2019-04-23T01:30:00.000Z",
    "modified_date_millis": 1556048137314,
    "policy" : {
      "schedule": "0 30 1 * * ?",
      "name": "<daily-snap-{now/d}>",
      "repository": "my_repository",
      "config": {
        "indices": ["data-*", "important"],
        "ignore_unavailable": false,
        "include_global_state": false
      },
      "retention": {}
    },
    "stats": {
      "snapshots_taken": 0,
      "snapshots_failed": 0,
      "snapshots_deleted": 0,
      "snapshot_deletion_failures": 0
    },
    "next_execution": "2019-04-24T01:30:00.000Z",
    "next_execution_millis": 1556048160000
  }
}
```

Relates to #43663

* Rewrite SnapshotLifecycleIT as as ESIntegTestCase (#46356)

* Rewrite SnapshotLifecycleIT as as ESIntegTestCase

This commit splits `SnapshotLifecycleIT` into two different tests.
`SnapshotLifecycleRestIT` which includes the tests that do not require
slow repositories, and `SLMSnapshotBlockingIntegTests` which is now an
integration test using `MockRepository` to simulate a snapshot being in
progress.

Relates to #43663
Resolves #46205

* Add error logging when exceptions are thrown

* Update serialization versions

* Fix type inference

* Use non-Cancellable HLRC return value

* Fix Client mocking in test

* Fix SLMSnapshotBlockingIntegTests for 7.x branch

* Update SnapshotRetentionTask for non-multi-repo snapshot retrieval

* Add serialization guards for SnapshotLifecyclePolicy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

3 participants