Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable rolling upgrades from default distribution prior to 6.3.0 to default distribution post 6.3.0 #30731

Closed
9 tasks done
jasontedor opened this issue May 18, 2018 · 29 comments
Closed
9 tasks done
Labels
blocker :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. :Distributed Coordination/Network Http and internode communication implementations :ml Machine learning v6.3.0

Comments

@jasontedor
Copy link
Member

jasontedor commented May 18, 2018

This is a meta-issue to track the work needed to enable smooth upgrades from the default distribution prior to 6.3.0 to the default distribution post 6.3.0. The sub-tasks are:

@jasontedor jasontedor added blocker :Distributed Coordination/Network Http and internode communication implementations v6.3.0 :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. :ml Machine learning labels May 18, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

droberts195 added a commit to droberts195/elasticsearch that referenced this issue May 21, 2018
This change is to support rolling upgrade from a pre-6.3 default
distribution (i.e. without X-Pack) to a 6.3+ default distribution
(i.e. with X-Pack).

The ML metadata is no longer eagerly added to the cluster state
as soon as the master node has X-Pack available.  Instead, it
is added when the first ML job is created.

As a result all methods that get the ML metadata need to be able
to handle the situation where there is no ML metadata in the
current cluster state.  They do this by behaving as though an
empty ML metadata was present.  This logic is encapsulated by
always asking for the current ML metadata using a static method
on the MlMetadata class.

Relates elastic#30731
@bleskes
Copy link
Contributor

bleskes commented May 21, 2018

Thank for putting this list up. I think we should also deal with the TokenMetaData injected here when called from here.

droberts195 added a commit that referenced this issue May 21, 2018
This change is to support rolling upgrade from a pre-6.3 default
distribution (i.e. without X-Pack) to a 6.3+ default distribution
(i.e. with X-Pack).

The ML metadata is no longer eagerly added to the cluster state
as soon as the master node has X-Pack available.  Instead, it
is added when the first ML job is created.

As a result all methods that get the ML metadata need to be able
to handle the situation where there is no ML metadata in the
current cluster state.  They do this by behaving as though an
empty ML metadata was present.  This logic is encapsulated by
always asking for the current ML metadata using a static method
on the MlMetadata class.

Relates #30731
@bleskes
Copy link
Contributor

bleskes commented May 21, 2018

I think we should also deal with the TokenMetaData injected here when called from here.

It looks like @ywelsch took care of it in https://github.com/elastic/elasticsearch/pull/30743/files

droberts195 added a commit that referenced this issue May 21, 2018
This change is to support rolling upgrade from a pre-6.3 default
distribution (i.e. without X-Pack) to a 6.3+ default distribution
(i.e. with X-Pack).

The ML metadata is no longer eagerly added to the cluster state
as soon as the master node has X-Pack available.  Instead, it
is added when the first ML job is created.

As a result all methods that get the ML metadata need to be able
to handle the situation where there is no ML metadata in the
current cluster state.  They do this by behaving as though an
empty ML metadata was present.  This logic is encapsulated by
always asking for the current ML metadata using a static method
on the MlMetadata class.

Relates #30731
droberts195 added a commit that referenced this issue May 21, 2018
This change is to support rolling upgrade from a pre-6.3 default
distribution (i.e. without X-Pack) to a 6.3+ default distribution
(i.e. with X-Pack).

The ML metadata is no longer eagerly added to the cluster state
as soon as the master node has X-Pack available.  Instead, it
is added when the first ML job is created.

As a result all methods that get the ML metadata need to be able
to handle the situation where there is no ML metadata in the
current cluster state.  They do this by behaving as though an
empty ML metadata was present.  This logic is encapsulated by
always asking for the current ML metadata using a static method
on the MlMetadata class.

Relates #30731
@nik9000
Copy link
Member

nik9000 commented May 23, 2018

Now that #30743 is merged I wanted to test this. The 6.3 branch works perfectly for me. The 6.x branch is failing though. That probably isn't a 6.3 release blocker but it is weird. The failure comes during the 5.6.10 upgrade to 6.x. The failure is:

[2018-05-23T12:59:33,101][ERROR][o.e.x.w.s.WatcherIndexTemplateRegistry] [node-0] Error adding watcher template [.watch-history-8]
org.elasticsearch.transport.RemoteTransportException: [node-2][127.0.0.1:41731][indices:admin/template/put]
Caused by: java.lang.IllegalArgumentException: unknown setting [index.xpack.watcher.template.version] please check that any required plugins are installed, or check the breaking changes documentation for removed settings
        at org.elasticsearch.common.settings.AbstractScopedSettings.validate(AbstractScopedSettings.java:293) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.common.settings.AbstractScopedSettings.validate(AbstractScopedSettings.java:256) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.common.settings.AbstractScopedSettings.validate(AbstractScopedSettings.java:246) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.action.admin.indices.template.put.TransportPutIndexTemplateAction.masterOperation(TransportPutIndexTemplateAction.java:80) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.action.admin.indices.template.put.TransportPutIndexTemplateAction.masterOperation(TransportPutIndexTemplateAction.java:42) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.action.support.master.TransportMasterNodeAction.masterOperation(TransportMasterNodeAction.java:87) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]

The test that actually fails is " org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT.test {p0=mixed_cluster/10_basic/Test old multi type stuff}" but it only fails because one of its actions times out because the cluster is busy trying the thing above over and over again.

@ywelsch
Copy link
Contributor

ywelsch commented May 23, 2018

@nik9000 This is a real problem (the x-pack node tries to add a template with x-pack only settings, and the OSS master rejects it). I'm not sure why it's not triggered by the test on the 6.3 branch as the same templating behavior exists there (for Watcher, Security etc.) as well. I consider this a blocker for 6.3 and will work on a solution tomorrow.

@nik9000
Copy link
Member

nik9000 commented May 23, 2018

I consider this a blocker for 6.3 and will work on a solution tomorrow.

❤️

It might not come up in 6.3 because I got lucky the couple of times I ran it. Maybe the 6.3 node won the master election.

It might still be a problem in the 6.3 branch but not bad enough to slow the tests down enough to fail. I'll see if I can write a test that outright fails without this. I didn't want to have to use watcher in these tests because it has so much state, but I suspect I have no choice here.

@jasontedor
Copy link
Member Author

I consider this a blocker for 6.3 and will work on a solution tomorrow.

+1, this is a blocker.

@ywelsch
Copy link
Contributor

ywelsch commented May 24, 2018

The issue with Watcher is that it uses a custom setting in its template. I've gone through the other XPack services to check if they present the same issue:

  • SecurityIndexManager adds templates (security-index-template.json) but no custom settings in template
  • IndexAuditTrail adds templates (security_audit_log.json) but no custom settings in template
  • LocalExporter adds templates (monitoring-*.json) but no custom settings. It also adds an ingest pipeline, but again no customs.
  • WatcherIndexTemplateRegistry adds templates (triggered-watches.json, watch-history.json, watches.json). Only watch-history uses a custom setting (xpack.watcher.template.version).

I'll explore getting rid of the xpack.watcher.template.version setting, and using the same approach as has been used for the other templates (e.g. security-index-template or security_audit_log.json) where there's a custom _meta field in the mapping.

@ywelsch
Copy link
Contributor

ywelsch commented May 24, 2018

I've opened #30832 for the watcher issue.

@ywelsch
Copy link
Contributor

ywelsch commented May 24, 2018

It might not come up in 6.3 because I got lucky the couple of times I ran it. Maybe the 6.3 node won the master election.

@nik9000 I've run the mixed-cluster tests a few times on 6.3, and I've seen this exception spamming the logs. The tests not failing on 6.3 are more of an indication that we need to add more tests.

@nik9000
Copy link
Member

nik9000 commented May 24, 2018

@nik9000 I've run the mixed-cluster tests a few times on 6.3, and I've seen this exception spamming the logs. The tests not failing on 6.3 are more of an indication that we need to add more tests.

I figured as much. Earlier I'd said:

It might still be a problem in the 6.3 branch but not bad enough to slow the tests down enough to fail. I'll see if I can write a test that outright fails without this.

and that is still my plan. I got distracted by other things and didn't end up writing the test.

nik9000 added a commit to nik9000/elasticsearch that referenced this issue May 24, 2018
Adds a test that we create the appropriate x-pack templates during the
rolling restart from the pre-6.2 OSS-zip distribution to the new zip
distribution that contains xpack. This is one way to answer the question
"does xpack acting sanely during the rolling upgrade and after it?" It
isn't as good as full exercising xpack but it is fairly simple and would
have caught elastic#30832.

Relates to elastic#30731
@nik9000
Copy link
Member

nik9000 commented May 24, 2018

So in my grand tradition of finding things, I believe the following is flaky:

git checkout 6.x
while ./gradlew -p qa/rolling-upgrade/ check -Dtests.distribution=zip; do echo ok; done

On my desktop about half of those runs fail with:

[2018-05-24T16:51:55,044][INFO ][o.e.c.s.MasterService    ] [node-0] zen-disco-elected-as-master ([2] nodes joined)[, ], reason: new_master {node-0}{EI796DscQYWC8OejKoXa5Q}{ikSXu_0GTiaygCs4aLzfug}{127.0.0.1}{127
.0.0.1:34129}{ml.machine_memory=33651564544, xpack.installed=true, testattr=test, ml.max_open_jobs=20, ml.enabled=true}
[2018-05-24T16:51:55,060][INFO ][o.e.c.s.ClusterApplierService] [node-0] new_master {node-0}{EI796DscQYWC8OejKoXa5Q}{ikSXu_0GTiaygCs4aLzfug}{127.0.0.1}{127.0.0.1:34129}{ml.machine_memory=33651564544, xpack.insta
lled=true, testattr=test, ml.max_open_jobs=20, ml.enabled=true}, reason: apply cluster state (from master [master {node-0}{EI796DscQYWC8OejKoXa5Q}{ikSXu_0GTiaygCs4aLzfug}{127.0.0.1}{127.0.0.1:34129}{ml.machine_m
emory=33651564544, xpack.installed=true, testattr=test, ml.max_open_jobs=20, ml.enabled=true} committed version [751] source [zen-disco-elected-as-master ([2] nodes joined)[, ]]])
[2018-05-24T16:51:55,085][WARN ][o.e.d.z.ZenDiscovery     ] [node-0] zen-disco-failed-to-publish, current nodes: nodes: 
   {node-0}{GsF5lxmmSQiemBzLM6Csbw}{lu_kNBYlROqRavg-7XxHSg}{127.0.0.1}{127.0.0.1:40615}{testattr=test, ml.machine_memory=33651564544, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}
   {node-0}{EI796DscQYWC8OejKoXa5Q}{ikSXu_0GTiaygCs4aLzfug}{127.0.0.1}{127.0.0.1:34129}{ml.machine_memory=33651564544, xpack.installed=true, testattr=test, ml.max_open_jobs=20, ml.enabled=true}, local, master
   {node-2}{ZOcWRGW1QNaBzGNXwigiBw}{xiHQaAk_ToS9Y52SnpY4Ww}{127.0.0.1}{127.0.0.1:42023}{testattr=test, gen=old, ml.machine_memory=33651564544, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}

[2018-05-24T16:51:55,085][WARN ][o.e.c.s.MasterService    ] [node-0] failing [maybe generate license for cluster]: failed to commit cluster state version [752]
org.elasticsearch.discovery.Discovery$FailedToCommitClusterStateException: failed to get enough masters to ack sent cluster state. [1] left
        at org.elasticsearch.discovery.zen.PublishClusterStateAction$SendingController.waitForCommit(PublishClusterStateAction.java:525) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.discovery.zen.PublishClusterStateAction.innerPublish(PublishClusterStateAction.java:196) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.discovery.zen.PublishClusterStateAction.publish(PublishClusterStateAction.java:161) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.discovery.zen.ZenDiscovery.publish(ZenDiscovery.java:336) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:225) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:132) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:625) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:844) [?:?]
[2018-05-24T16:51:55,086][ERROR][o.e.l.StartupSelfGeneratedLicenseTask] [node-0] unexpected failure during [maybe generate license for cluster]
org.elasticsearch.discovery.Discovery$FailedToCommitClusterStateException: failed to get enough masters to ack sent cluster state. [1] left
        at org.elasticsearch.discovery.zen.PublishClusterStateAction$SendingController.waitForCommit(PublishClusterStateAction.java:525) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.discovery.zen.PublishClusterStateAction.innerPublish(PublishClusterStateAction.java:196) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.discovery.zen.PublishClusterStateAction.publish(PublishClusterStateAction.java:161) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.discovery.zen.ZenDiscovery.publish(ZenDiscovery.java:336) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:225) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:132) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:625) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:844) [?:?]

Which looks pretty incriminating. There are totally three 3 available, master eligible nodes. It just logged about them.

But this is 6.x testing the upgrade from 6.3. I've seen no trouble going from 5.6 to 6.x or 6.2 going to 6.x. Go figure.

@nik9000
Copy link
Member

nik9000 commented May 24, 2018

The other nodes see NotMasterException, but I think they are deserializing that from the master. Which doesn't think it is a master.

@ywelsch
Copy link
Contributor

ywelsch commented May 25, 2018

@nik9000 With my fix in #30859, I have this consistently passing now

@ywelsch
Copy link
Contributor

ywelsch commented May 25, 2018

The fix only helps in case of a rolling restart, so the mixed-cluster tests still fail with this fix on 6.x. I will have to discuss with @elastic/es-security what can be done about the mixed-cluster situation.

nik9000 added a commit that referenced this issue May 25, 2018
Adds a test that we create the appropriate x-pack templates during the
rolling restart from the pre-6.2 OSS-zip distribution to the new zip
distribution that contains xpack. This is one way to answer the question
"does xpack acting sanely during the rolling upgrade and after it?" It
isn't as good as full exercising xpack but it is fairly simple and would
have caught #30832.

Relates to #30731
nik9000 added a commit that referenced this issue May 25, 2018
Adds a test that we create the appropriate x-pack templates during the
rolling restart from the pre-6.2 OSS-zip distribution to the new zip
distribution that contains xpack. This is one way to answer the question
"does xpack acting sanely during the rolling upgrade and after it?" It
isn't as good as full exercising xpack but it is fairly simple and would
have caught #30832.

Relates to #30731
@ywelsch
Copy link
Contributor

ywelsch commented May 30, 2018

Reminder to self: We also need to fix PersistentTaskParams so that it knows about the versions etc. Assume for example a mixed 6.2 / 6.3 x-pack cluster. If you start a rollup task, this will be put as persistent task into the cluster state. A 6.2 x-pack node cannot deserialize this task.

ywelsch added a commit that referenced this issue Jun 1, 2018
With the default distribution changing in 6.3, clusters might now contain custom metadata that a
pure OSS transport client cannot deserialize. As this can break transport clients when accessing
the cluster state or reroute APIs, we've decided to exclude any custom metadata that the transport
client might not be able to deserialize. This will ensure compatibility between a < 6.3 transport
client and a 6.3 default distribution cluster. Note that this PR only covers interoperability with older
clients, another follow-up PR will cover full interoperability for >= 6.3 transport clients where we will
make it possible again to get the custom metadata from the cluster state.

Relates to #30731
ywelsch added a commit that referenced this issue Jun 1, 2018
With the default distribution changing in 6.3, clusters might now contain custom metadata that a
pure OSS transport client cannot deserialize. As this can break transport clients when accessing
the cluster state or reroute APIs, we've decided to exclude any custom metadata that the transport
client might not be able to deserialize. This will ensure compatibility between a < 6.3 transport
client and a 6.3 default distribution cluster. Note that this PR only covers interoperability with older
clients, another follow-up PR will cover full interoperability for >= 6.3 transport clients where we will
make it possible again to get the custom metadata from the cluster state.

Relates to #30731
ywelsch added a commit that referenced this issue Jun 1, 2018
With the default distribution changing in 6.3, clusters might now contain custom metadata that a
pure OSS transport client cannot deserialize. As this can break transport clients when accessing
the cluster state or reroute APIs, we've decided to exclude any custom metadata that the transport
client might not be able to deserialize. This will ensure compatibility between a < 6.3 transport
client and a 6.3 default distribution cluster. Note that this PR only covers interoperability with older
clients, another follow-up PR will cover full interoperability for >= 6.3 transport clients where we will
make it possible again to get the custom metadata from the cluster state.

Relates to #30731
@nik9000
Copy link
Member

nik9000 commented Jun 1, 2018

add more tests (e.g. that rollups cannot be created in a mixed OSS/X-Pack cluster. Mixed-cluster X-pack tests)

I'll take this.

@ywelsch ywelsch reopened this Jun 1, 2018
bleskes added a commit that referenced this issue Jun 3, 2018
With #31020 we introduced the ability for transport clients to indicate what features they support
in order to make sure we don't serialize object to them they don't support. This PR adapts the
serialization logic of persistent tasks to be aware of those features and not serialize tasks that
aren't supported. 

Also, a version check is added for the future where we may add new tasks implementations and
need to be able to indicate they shouldn't be serialized both to nodes and clients.

As the implementation relies on the interface of `PersistentTaskParams`, these are no longer
optional. That's acceptable as all current implementation have them and we plan to make
`PersistentTaskParams` more central in the future.

Relates to #30731
@danielmitterdorfer
Copy link
Member

We have another test failure that I believe belongs here as well (there are several tests failing but they appear to do so for the same reason).

The first one of them has the reproduction line:

./gradlew :qa:rolling-upgrade:v6.3.0-SNAPSHOT#twoThirdsUpgradedTestRunner -Dtests.seed=F070E6B561B268B2 -Dtests.class=org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT -Dtests.method="test {p0=mixed_cluster/10_basic/Verify custom cluster metadata still exists during upgrade}" -Dtests.security.manager=true -Dtests.locale=nl-BE -Dtests.timezone=America/Cuiaba -Dtests.distribution=zip -Dtests.rest.suite=mixed_cluster
Click arrow for failure details
07:23:15 ERROR   67.1s | UpgradeClusterClientYamlTestSuiteIT.test {p0=mixed_cluster/10_basic/Verify custom cluster metadata still exists during upgrade} <<< FAILURES!
07:23:15    > Throwable #1: org.elasticsearch.client.ResponseException: method [GET], host [http://[::1]:40576], URI [/], status line [HTTP/1.1 503 Service Unavailable]
07:23:15    > {
07:23:15    >   "name" : "node-0",
07:23:15    >   "cluster_name" : "rolling-upgrade",
07:23:15    >   "cluster_uuid" : "OhN80TdsRXmqjvybzQA48A",
07:23:15    >   "version" : {
07:23:15    >     "number" : "6.4.0",
07:23:15    >     "build_flavor" : "default",
07:23:15    >     "build_type" : "zip",
07:23:15    >     "build_hash" : "1eede11",
07:23:15    >     "build_date" : "2018-06-04T06:30:39.454194Z",
07:23:15    >     "build_snapshot" : true,
07:23:15    >     "lucene_version" : "7.4.0",
07:23:15    >     "minimum_wire_compatibility_version" : "5.6.0",
07:23:15    >     "minimum_index_compatibility_version" : "5.0.0"
07:23:15    >   },
07:23:15    >   "tagline" : "You Know, for Search"
07:23:15    > }
07:23:15    > 	at org.elasticsearch.client.RestClient$SyncResponseListener.get(RestClient.java:821)
07:23:15    > 	at org.elasticsearch.client.RestClient.performRequest(RestClient.java:182)
07:23:15    > 	at org.elasticsearch.client.RestClient.performRequest(RestClient.java:227)
07:23:15    > 	at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.readVersionsFromInfo(ESClientYamlSuiteTestCase.java:282)
07:23:15    > 	at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.initAndResetContext(ESClientYamlSuiteTestCase.java:106)
07:23:15    > 	at java.lang.Thread.run(Thread.java:748)
07:23:15    > Caused by: org.elasticsearch.client.ResponseException: method [GET], host [http://[::1]:40576], URI [/], status line [HTTP/1.1 503 Service Unavailable]
07:23:15    > {
07:23:15    >   "name" : "node-0",
07:23:15    >   "cluster_name" : "rolling-upgrade",
07:23:15    >   "cluster_uuid" : "OhN80TdsRXmqjvybzQA48A",
07:23:15    >   "version" : {
07:23:15    >     "number" : "6.4.0",
07:23:15    >     "build_flavor" : "default",
07:23:15    >     "build_type" : "zip",
07:23:15    >     "build_hash" : "1eede11",
07:23:15    >     "build_date" : "2018-06-04T06:30:39.454194Z",
07:23:15    >     "build_snapshot" : true,
07:23:15    >     "lucene_version" : "7.4.0",
07:23:15    >     "minimum_wire_compatibility_version" : "5.6.0",
07:23:15    >     "minimum_index_compatibility_version" : "5.0.0"
07:23:15    >   },
07:23:15    >   "tagline" : "You Know, for Search"
07:23:15    > }
07:23:15    > 	at org.elasticsearch.client.RestClient$1.completed(RestClient.java:495)
07:23:15    > 	at org.elasticsearch.client.RestClient$1.completed(RestClient.java:484)
07:23:15    > 	at org.apache.http.concurrent.BasicFuture.completed(BasicFuture.java:119)
07:23:15    > 	at org.apache.http.impl.nio.client.DefaultClientExchangeHandlerImpl.responseCompleted(DefaultClientExchangeHandlerImpl.java:177)
07:23:15    > 	at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.processResponse(HttpAsyncRequestExecutor.java:436)
07:23:15    > 	at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.inputReady(HttpAsyncRequestExecutor.java:326)
07:23:15    > 	at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:265)
07:23:15    > 	at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:81)
07:23:15    > 	at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:39)
07:23:15    > 	at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:114)
07:23:15    > 	at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162)
07:23:15    > 	at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337)
07:23:15    > 	at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315)
07:23:15    > 	at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276)
07:23:15    > 	at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
07:23:15    > 	at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:588)
07:23:15    > 	... 1 moreThrowable #2: java.lang.AssertionError: there are still running tasks:
07:23:15    > {time_in_queue=15ms, time_in_queue_millis=15, source=zen-disco-elected-as-master ([2] nodes joined), executing=true, priority=URGENT, insert_order=185}
07:23:15    > {time_in_queue=1ms, time_in_queue_millis=1, source=install-token-metadata, executing=false, priority=URGENT, insert_order=186}
07:23:15    > {time_in_queue=0s, time_in_queue_millis=0, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=187}
07:23:15    > 	at org.elasticsearch.test.rest.ESRestTestCase.lambda$waitForClusterStateUpdatesToFinish$0(ESRestTestCase.java:347)
07:23:15    > 	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:767)
07:23:15    > 	at org.elasticsearch.test.rest.ESRestTestCase.waitForClusterStateUpdatesToFinish(ESRestTestCase.java:338)
07:23:15    > 	at org.elasticsearch.test.rest.ESRestTestCase.cleanUpCluster(ESRestTestCase.java:151)
07:23:15    > 	at java.lang.Thread.run(Thread.java:748)
07:23:15    > 	Suppressed: java.lang.AssertionError: there are still running tasks:
07:23:15    > {time_in_queue=86ms, time_in_queue_millis=86, source=cluster_reroute(async_shard_fetch), executing=true, priority=HIGH, insert_order=42}
07:23:15    > {time_in_queue=71ms, time_in_queue_millis=71, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=46}
07:23:15    > {time_in_queue=4ms, time_in_queue_millis=4, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=47}
07:23:15    > 		at org.elasticsearch.test.rest.ESRestTestCase.lambda$waitForClusterStateUpdatesToFinish$0(ESRestTestCase.java:347)
07:23:15    > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:755)
07:23:15    > 		... 37 more
07:23:15    > 	Suppressed: java.lang.AssertionError: there are still running tasks:
07:23:15    > {time_in_queue=96ms, time_in_queue_millis=96, source=cluster_reroute(async_shard_fetch), executing=true, priority=HIGH, insert_order=42}
07:23:15    > {time_in_queue=81ms, time_in_queue_millis=81, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=46}
07:23:15    > {time_in_queue=14ms, time_in_queue_millis=14, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=47}
07:23:15    > 		at org.elasticsearch.test.rest.ESRestTestCase.lambda$waitForClusterStateUpdatesToFinish$0(ESRestTestCase.java:347)
07:23:15    > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:755)
07:23:15    > 		... 37 more
07:23:15    > 	Suppressed: java.lang.AssertionError: there are still running tasks:
07:23:15    > {time_in_queue=102ms, time_in_queue_millis=102, source=cluster_reroute(async_shard_fetch), executing=true, priority=HIGH, insert_order=42}
07:23:15    > {time_in_queue=86ms, time_in_queue_millis=86, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=46}
07:23:15    > {time_in_queue=20ms, time_in_queue_millis=20, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=47}
07:23:15    > 		at org.elasticsearch.test.rest.ESRestTestCase.lambda$waitForClusterStateUpdatesToFinish$0(ESRestTestCase.java:347)
07:23:15    > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:755)
07:23:15    > 		... 37 more
07:23:15    > 	Suppressed: java.lang.AssertionError: there are still running tasks:
07:23:15    > {time_in_queue=109ms, time_in_queue_millis=109, source=cluster_reroute(async_shard_fetch), executing=true, priority=HIGH, insert_order=42}
07:23:15    > {time_in_queue=94ms, time_in_queue_millis=94, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=46}
07:23:15    > {time_in_queue=27ms, time_in_queue_millis=27, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=47}
07:23:15    > 		at org.elasticsearch.test.rest.ESRestTestCase.lambda$waitForClusterStateUpdatesToFinish$0(ESRestTestCase.java:347)
07:23:15    > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:755)
07:23:15    > 		... 37 more
07:23:15    > 	Suppressed: java.lang.AssertionError: there are still running tasks:
07:23:15    > {time_in_queue=106ms, time_in_queue_millis=106, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=46}
07:23:15    > {time_in_queue=40ms, time_in_queue_millis=40, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=47}
07:23:15    > {time_in_queue=0s, time_in_queue_millis=0, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=49}
07:23:15    > 		at org.elasticsearch.test.rest.ESRestTestCase.lambda$waitForClusterStateUpdatesToFinish$0(ESRestTestCase.java:347)
07:23:15    > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:755)
07:23:15    > 		... 37 more    
    

Full cluster logs are available in rolling-upgrade-cluster-logs.zip

bleskes added a commit that referenced this issue Jun 4, 2018
With #31020 we introduced the ability for transport clients to indicate what features they support
in order to make sure we don't serialize object to them they don't support. This PR adapts the
serialization logic of persistent tasks to be aware of those features and not serialize tasks that
aren't supported.

Also, a version check is added for the future where we may add new tasks implementations and
need to be able to indicate they shouldn't be serialized both to nodes and clients.

As the implementation relies on the interface of `PersistentTaskParams`, these are no longer
optional. That's acceptable as all current implementation have them and we plan to make
`PersistentTaskParams` more central in the future.

Relates to #30731
bleskes added a commit that referenced this issue Jun 4, 2018
With #31020 we introduced the ability for transport clients to indicate what features they support
in order to make sure we don't serialize object to them they don't support. This PR adapts the
serialization logic of persistent tasks to be aware of those features and not serialize tasks that
aren't supported.

Also, a version check is added for the future where we may add new tasks implementations and
need to be able to indicate they shouldn't be serialized both to nodes and clients.

As the implementation relies on the interface of `PersistentTaskParams`, these are no longer
optional. That's acceptable as all current implementation have them and we plan to make
`PersistentTaskParams` more central in the future.

Relates to #30731
@nik9000
Copy link
Member

nik9000 commented Jun 4, 2018

I had a look at @danielmitterdorfer's failure. A few things:

  1. This is an xpack to xpack upgrade from 6.3 to 6.x
  2. I read the logs to say that:
    a. Everything goes fine until the we upgrade the second node.
    b. Once the second upgraded node comes online the first upgraded node is elected the master.
    c. It starts doing housekeeping like upgrading templates.
    d. It can't sync the cluster state

I'd expect there to be a failure somewhere in the log describing how the cluster state sync failed. But I can't find one. All of the exceptions have to do with the restarts and the cluster not having a valid master after the incident.

@ywelsch
Copy link
Contributor

ywelsch commented Jun 5, 2018

@nik9000 @danielmitterdorfer this will be fixed by #30859. It's not blocking the 6.3 release, but the 6.4 release.

ywelsch added a commit that referenced this issue Jun 5, 2018
Allows rolling restart from 6.3 to 6.4.

Relates to #30731 and #30251
ywelsch added a commit that referenced this issue Jun 5, 2018
Allows rolling restart from 6.3 to 6.4.

Relates to #30731 and #30251
@nik9000
Copy link
Member

nik9000 commented Jun 5, 2018

I've opened #31112 to make the x-pack upgrade tests (all three of them) use three nodes. It isn't perfect but it is about as complex as I'd like to get and still backport to 6.3.

@nik9000
Copy link
Member

nik9000 commented Jun 7, 2018

So I merged #31112 to master and 6.x yesterday but that caused all kinds of problems. I'm trying to un-break them now. I'll merge to 6.3 once everything is calmer in the branches I've already merged to.

@nik9000
Copy link
Member

nik9000 commented Jun 8, 2018

I've finished backporting #31112 to 6.3. We can see how it does over the weekend.

#31195 is still open to enable one of the tests after the backport but it is a upgrade from 5.6.10 to 6.3 test so I think we're fairly ok. It is almost certainly a test bug.

@nik9000
Copy link
Member

nik9000 commented Jun 11, 2018

The weekend went well as far as the backwards compatibility builds goes! I'm happy to say that the upgrades looked great. I think I'm done here.

@jasontedor
Copy link
Member Author

Thank you all that contributed to the effort here, this was a great effort all around.

Closed by the hard work of a lot of people

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. :Distributed Coordination/Network Http and internode communication implementations :ml Machine learning v6.3.0
Projects
None yet
Development

No branches or pull requests

6 participants