Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CreateIndexIT.testCreateAndDeleteIndexConcurrently fails #15312

Closed
ywelsch opened this issue Dec 8, 2015 · 4 comments
Closed

CreateIndexIT.testCreateAndDeleteIndexConcurrently fails #15312

ywelsch opened this issue Dec 8, 2015 · 4 comments
Labels
>test Issues or PRs that are addressing/adding tests >test-failure Triaged test failures from CI

Comments

@ywelsch
Copy link
Contributor

ywelsch commented Dec 8, 2015

http://build-us-00.elastic.co/job/es_core_master_centos/8910/testReport/junit/org.elasticsearch.action.admin.indices.create/CreateIndexIT/testCreateAndDeleteIndexConcurrently/

Stacktrace:

UnavailableShardsException[[test][1] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: index {[test][test][AVGADS6IhWB4gLTwZw1z], source[{"index_version":1}]}]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.retryBecauseUnavailable(TransportReplicationAction.java:665)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:381)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase$3.onTimeout(TransportReplicationAction.java:522)
    at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:236)
    at org.elasticsearch.cluster.service.InternalClusterService$NotifyTimeout.run(InternalClusterService.java:628)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Interesting log lines:

[2015-12-08 05:25:16,274][WARN ][org.elasticsearch.indices.cluster] [node_t0] [[test][1]] marking and sending shard failed due to [failed to create shard]
org.apache.lucene.store.LockObtainFailedException: Can't lock shard [test][1], timed out after 5000ms
    at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:556)
    at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:485)
    at org.elasticsearch.index.IndexService.createShard(IndexService.java:236)
    at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyInitializingShard(IndicesClusterStateService.java:591)
    at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyNewOrUpdatedShards(IndicesClusterStateService.java:491)
    at org.elasticsearch.indices.cluster.IndicesClusterStateService.clusterChanged(IndicesClusterStateService.java:185)
    at org.elasticsearch.cluster.service.InternalClusterService.runTasksForExecutor(InternalClusterService.java:526)
    at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:596)
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:231)
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:194)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

...

[2015-12-08 05:25:26,295][DEBUG][org.elasticsearch.cluster.routing.allocation] [node_t0] [test][0] ignoring shard failure, unknown allocation id in [test][0], node[w7ZOpTaqRi6UI-q9EjfG2Q], [P], v[1], s[INITIALIZING], a[id=WmeldM5ORfuNr8rWhbHvYQ], unassigned_info[[reason=INDEX_CREATED], at[2015-12-08T05:25:11.268Z]] ([reason=ALLOCATION_FAILED], at[2015-12-08T05:25:26.295Z], details[master {node_t0}{w7ZOpTaqRi6UI-q9EjfG2Q}{local}{local[182]}[mode=>local] marked shard as initializing, but shard is marked as failed, resend shard failure])

...

[2015-12-08 05:25:26,303][DEBUG][org.elasticsearch.gateway] [node_t0] [test][0] found 0 allocations of [test][0], node[null], [P], v[2], s[UNASSIGNED], unassigned_info[[reason=ALLOCATION_FAILED], at[2015-12-08T05:25:26.294Z], details[failed to create shard, failure LockObtainFailedException[Can't lock shard [test][0], timed out after 5000ms]]], highest version: [-1]

Analysis:

What's happening is that index creation fails as locks cannot be obtained for allocating shards (I'm not sure why that is). Shards are marked as failed: This changes UnassignedInfo of the ShardRouting objects by setting reason from INDEX_CREATED to ALLOCATION_FAILED. What this entails is that the PrimaryShardAllocator does not treat these unassigned shards as "new index creation" anymore, but requires that shard data is available before assigning them. Relates to #15241.

@ywelsch ywelsch added >test Issues or PRs that are addressing/adding tests >test-failure Triaged test failures from CI labels Dec 8, 2015
@bleskes
Copy link
Contributor

bleskes commented Dec 8, 2015

What's happening is that index creation fails as locks cannot be obtained for allocating shards (I'm not sure why that is).

That's a good question and probably the source of this trouble

What this entails is that the PrimaryShardAllocator does not treat these unassigned shards as "new index creation" anymore, but requires that shard data is available before assigning them.

I think this will be fixed by #15281 or a follow up where we wouldn't need the allocatedPostIndexCreate() flag for primary allocations as the activeAllocaitonId will tell us.

@bleskes
Copy link
Contributor

bleskes commented Dec 8, 2015

and good analysis :)

s1monw added a commit to s1monw/elasticsearch that referenced this issue Jan 8, 2016
The test fails in different ways since we can hit timeouts and shards not being available on retry
which we simply didn't expect before but are valid.

Closes elastic#15312
Closes elastic#14512
@bleskes
Copy link
Contributor

bleskes commented Jan 8, 2016

Closing as a duplicate of #14932 , w.r.t shard locking issue. The primary allocation part has been fixed with #15281

@bleskes bleskes closed this as completed Jan 8, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>test Issues or PRs that are addressing/adding tests >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

2 participants