-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ShrinkIndexIT.testCreateShrinkIndexToN fails on Windows #33857
Comments
Pinging @elastic/es-core-infra |
@jasontedor @bleskes any ideas here? I looked but it's really puzzeling |
Pinging @elastic/es-distributed |
Relates #30962 |
It's failed again. I'm going to mute it on Windows unless someone says they need to capture more failure logs. |
I've reenabled the test on master (db32781) and added DEBUG logging for |
Unfortunately that was on 5.6 (which does not have the extended logging, and which we forgot to mute). I've muted the test on 5.6 for now (a6aa773). |
Now there's a master failure :) |
|
Similar failures on 5.6, 6.x and master in the last 12 hours.
seems to be caused by AccessDeniedException
|
fyi. recent master change built was: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-windows-compatibility/19/console |
I think this also affects These two build logs contain the extra debugging info: |
I spent some time trying to reproduce this by hand on one of the Windows CI workers. Although I never saw the test fail, I did see tests reporting the failures of recoveries for the reason quoted in the OP:
Here is a copy of the console output, including repeated test runs (of which the stack trace above is the last). This tarball also includes a Process Monitor log from this run, filtered to just the relevant paths: 33857.tar.gz Interesting things to note from the process monitor log are:
I do not yet have a good idea for a next step on this. |
On repeated runs I do see occasional failures as per the OP (maybe 20% of the time, I've not done the stats in detail yet). I tried writing a test case that repeatedly hits the filesystem directly and could not reproduce this
I was suspicious about that |
We have more test failures on Windows with the same cause:
Expand for stack trace[2018-09-27T18:57:01,933][WARN ][o.e.c.a.s.ShardStateAction] [node_s0] [index_2][0] received shard failed for shard id [[index_2][0]], allocation id [0Jyeijx5QoeBLzJ85Ym4uw], primary term [0], message [failed recovery], failure [RecoveryFailedException[[index_2][0]: Recovery failed from {node_s1}{UFRWgSweS6uUDivYiTC1-w}{4v-Ak5QCQ62QpeiZyzwK8w}{local}{local[368]} into {node_s0}{Povm-pMfSAaqPh8OcDkjHg}{3UCCMKflQTOz9cF4-uejCA}{local}{local[366]}]; nested: RemoteTransportException[[node_s0][local[366]][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [25] files with total size of [36.4kb]]; nested: RemoteTransportException[[node_s1][local[368]][internal:index/shard/recovery/clean_files]]; nested: AccessDeniedException[C:\Users\jenkins\workspace\elastic+elasticsearch+5.6+multijob-windows-compatibility\core\build\testrun\integTest\J0\temp\org.elasticsearch.routing.PartitionedRoutingIT_D736F1C7B3CED870-001\tempDir-002\data\nodes\0\indices\qvLxi615Saym8cpW_c-H4Q\0\index\recovery.AWYcZDdo0i__cHMcxiHN._0.cfs -> C:\Users\jenkins\workspace\elastic+elasticsearch+5.6+multijob-windows-compatibility\core\build\testrun\integTest\J0\temp\org.elasticsearch.routing.PartitionedRoutingIT_D736F1C7B3CED870-001\tempDir-002\data\nodes\0\indices\qvLxi615Saym8cpW_c-H4Q\0\index\_0.cfs]; ] [...] Caused by: java.nio.file.AccessDeniedException: C:\Users\jenkins\workspace\elastic+elasticsearch+5.6+multijob-windows-compatibility\core\build\testrun\integTest\J0\temp\org.elasticsearch.routing.PartitionedRoutingIT_D736F1C7B3CED870-001\tempDir-002\data\nodes\0\indices\qvLxi615Saym8cpW_c-H4Q\0\index\recovery.AWYcZDdo0i__cHMcxiHN._0.cfs -> C:\Users\jenkins\workspace\elastic+elasticsearch+5.6+multijob-windows-compatibility\core\build\testrun\integTest\J0\temp\org.elasticsearch.routing.PartitionedRoutingIT_D736F1C7B3CED870-001\tempDir-002\data\nodes\0\indices\qvLxi615Saym8cpW_c-H4Q\0\index\_0.cfs at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:83) ~[?:?] at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97) ~[?:?] at sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:301) ~[?:?] at sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.java:287) ~[?:?] at org.apache.lucene.mockfile.FilterFileSystemProvider.move(FilterFileSystemProvider.java:147) ~[lucene-test-framework-6.6.1.jar:6.6.1 9aa465a89b64ff2dabe7b4d50c472de32c298683 - varunthacker - 2017-08-29 21:54:40] at org.apache.lucene.mockfile.FilterFileSystemProvider.move(FilterFileSystemProvider.java:147) ~[lucene-test-framework-6.6.1.jar:6.6.1 9aa465a89b64ff2dabe7b4d50c472de32c298683 - varunthacker - 2017-08-29 21:54:40] at java.nio.file.Files.move(Files.java:1395) ~[?:1.8.0_181] at org.apache.lucene.store.FSDirectory.rename(FSDirectory.java:297) ~[lucene-core-6.6.1.jar:6.6.1 9aa465a89b64ff2dabe7b4d50c472de32c298683 - varunthacker - 2017-08-29 21:54:39] at org.apache.lucene.store.FilterDirectory.rename(FilterDirectory.java:88) ~[lucene-core-6.6.1.jar:6.6.1 9aa465a89b64ff2dabe7b4d50c472de32c298683 - varunthacker - 2017-08-29 21:54:39] at org.apache.lucene.store.MockDirectoryWrapper.rename(MockDirectoryWrapper.java:229) ~[lucene-test-framework-6.6.1.jar:6.6.1 9aa465a89b64ff2dabe7b4d50c472de32c298683 - varunthacker - 2017-08-29 21:54:39] at org.apache.lucene.store.FilterDirectory.rename(FilterDirectory.java:88) ~[lucene-core-6.6.1.jar:6.6.1 9aa465a89b64ff2dabe7b4d50c472de32c298683 - varunthacker - 2017-08-29 21:54:39] at org.elasticsearch.index.store.Store.renameTempFilesSafe(Store.java:319) ~[main/:?] at org.elasticsearch.indices.recovery.RecoveryTarget.renameAllTempFiles(RecoveryTarget.java:181) ~[main/:?] at org.elasticsearch.indices.recovery.RecoveryTarget.cleanFiles(RecoveryTarget.java:406) ~[main/:?] at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$CleanFilesRequestHandler.messageReceived(PeerRecoveryTargetService.java:486) ~[main/:?] at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$CleanFilesRequestHandler.messageReceived(PeerRecoveryTargetService.java:480) ~[main/:?] at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) ~[main/:?] at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[main/:?] at org.elasticsearch.transport.local.LocalTransport$2.doRun(LocalTransport.java:390) ~[main/:?]
Expand for stack traceorg.elasticsearch.indices.recovery.RecoveryFailedException: [index_4][1]: Recovery failed from {node_s0}{7C-fb4R9RK-WS4Jk1WYXuQ}{BijPdN92Q32zPDZ9t_ldiA}{127.0.0.1}{127.0.0.1:55085} into {node_s1}{RqyKMgp3QZ-gu0IVK_U0AA}{I5djfftGTtmtr6bQbOAUJA}{127.0.0.1}{127.0.0.1:55086} 1> at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.doRecovery(PeerRecoveryTargetService.java:282) [main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.access$900(PeerRecoveryTargetService.java:80) [main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$RecoveryRunner.doRun(PeerRecoveryTargetService.java:624) [main/:?] 1> at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723) [main/:?] 1> at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?] 1> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181] 1> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181] 1> at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181] 1> Caused by: org.elasticsearch.transport.RemoteTransportException: [node_s0][127.0.0.1:55089][internal:index/shard/recovery/start_recovery] 1> Caused by: org.elasticsearch.index.engine.RecoveryEngineException: Phase[1] phase1 failed 1> at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:174) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:98) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$000(PeerRecoverySourceService.java:50) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:107) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:104) ~[main/:?] 1> at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) ~[main/:?] 1> at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[main/:?] 1> at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1350) ~[main/:?] 1> ... 5 more 1> Caused by: org.elasticsearch.indices.recovery.RecoverFilesRecoveryException: Failed to transfer [13] files with total size of [19.3kb] 1> at org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:424) ~[main/:?] 1> at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:172) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:98) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$000(PeerRecoverySourceService.java:50) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:107) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:104) ~[main/:?] 1> at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) ~[main/:?] 1> at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[main/:?] 1> at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1350) ~[main/:?] 1> ... 5 more 1> Caused by: org.elasticsearch.transport.RemoteTransportException: [node_s1][127.0.0.1:55090][internal:index/shard/recovery/clean_files] 1> Caused by: java.nio.file.AccessDeniedException: C:\Users\jenkins\workspace\elastic+elasticsearch+6.x+multijob-windows-compatibility\server\build\testrun\integTest\J2\temp\org.elasticsearch.routing.PartitionedRoutingIT_CAE88D19276CD2B2-001\tempDir-002\data\nodes\1\indices\H6TUqJpRTRedJr1z26jubw\1\index\recovery.fshzJJ1CQCelF3mD6b587Q._0.cfs -> C:\Users\jenkins\workspace\elastic+elasticsearch+6.x+multijob-windows-compatibility\server\build\testrun\integTest\J2\temp\org.elasticsearch.routing.PartitionedRoutingIT_CAE88D19276CD2B2-001\tempDir-002\data\nodes\1\indices\H6TUqJpRTRedJr1z26jubw\1\index\_0.cfs 1> at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:83) ~[?:?] 1> at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97) ~[?:?] 1> at sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:301) ~[?:?] 1> at sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.java:287) ~[?:?] 1> at org.apache.lucene.mockfile.FilterFileSystemProvider.move(FilterFileSystemProvider.java:147) ~[lucene-test-framework-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - jimczi - 2018-09-18 13:01:16] 1> at java.nio.file.Files.move(Files.java:1395) ~[?:1.8.0_181] 1> at org.apache.lucene.store.FSDirectory.rename(FSDirectory.java:303) ~[lucene-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - jimczi - 2018-09-18 13:01:13] 1> at org.apache.lucene.store.MockDirectoryWrapper.rename(MockDirectoryWrapper.java:231) ~[lucene-test-framework-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - jimczi - 2018-09-18 13:01:13] 1> at org.apache.lucene.store.FilterDirectory.rename(FilterDirectory.java:89) ~[lucene-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - jimczi - 2018-09-18 13:01:13] 1> at org.apache.lucene.store.FilterDirectory.rename(FilterDirectory.java:89) ~[lucene-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - jimczi - 2018-09-18 13:01:13] 1> at org.elasticsearch.index.store.Store.renameTempFilesSafe(Store.java:334) ~[main/:?] 1> at org.elasticsearch.indices.recovery.RecoveryTarget.renameAllTempFiles(RecoveryTarget.java:188) ~[main/:?] 1> at org.elasticsearch.indices.recovery.RecoveryTarget.cleanFiles(RecoveryTarget.java:452) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$CleanFilesRequestHandler.messageReceived(PeerRecoveryTargetService.java:557) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$CleanFilesRequestHandler.messageReceived(PeerRecoveryTargetService.java:551) ~[main/:?] 1> at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) ~[main/:?] 1> at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[main/:?] 1> at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1350) ~[main/:?]
Expand for stack traceorg.elasticsearch.indices.recovery.RecoveryFailedException: [index_4][3]: Recovery failed from {node_sd1}{O8l8bMuJTv2FfmOwRxXWgg}{1m7utzhqQjWidDUawrikfQ}{127.0.0.1}{127.0.0.1:63944} into {node_sd2}{IBcohRX6RB2qHr4S6wDBUQ}{Fa3f-veRRFOWmrrADglJ9A}{127.0.0.1}{127.0.0.1:63945} 1> at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.doRecovery(PeerRecoveryTargetService.java:282) [main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.access$900(PeerRecoveryTargetService.java:80) [main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$RecoveryRunner.doRun(PeerRecoveryTargetService.java:623) [main/:?] 1> at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723) [main/:?] 1> at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?] 1> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181] 1> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181] 1> at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181] 1> Caused by: org.elasticsearch.transport.RemoteTransportException: [node_sd1][127.0.0.1:63944][internal:index/shard/recovery/start_recovery] 1> Caused by: org.elasticsearch.index.engine.RecoveryEngineException: Phase[1] phase1 failed 1> at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:174) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:98) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$000(PeerRecoverySourceService.java:50) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:107) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:104) ~[main/:?] 1> at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) ~[main/:?] 1> at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[main/:?] 1> at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1605) ~[main/:?] 1> ... 5 more 1> Caused by: org.elasticsearch.indices.recovery.RecoverFilesRecoveryException: Failed to transfer [4] files with total size of [4.5kb] 1> at org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:446) ~[main/:?] 1> at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:172) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:98) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$000(PeerRecoverySourceService.java:50) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:107) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:104) ~[main/:?] 1> at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) ~[main/:?] 1> at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[main/:?] 1> at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1605) ~[main/:?] 1> ... 5 more 1> Caused by: org.elasticsearch.indices.recovery.RecoverFilesRecoveryException: Failed to transfer [4] files with total size of [4.5kb] 1> at org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:446) ~[main/:?] 1> at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:172) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:98) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$000(PeerRecoverySourceService.java:50) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:107) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:104) ~[main/:?] 1> at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) ~[main/:?] 1> at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[main/:?] 1> at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1605) ~[main/:?] 1> ... 5 more 1> Caused by: org.elasticsearch.transport.RemoteTransportException: [node_sd2][127.0.0.1:63945][internal:index/shard/recovery/clean_files] 1> Caused by: java.nio.file.AccessDeniedException: C:\Users\jenkins\workspace\elastic+elasticsearch+6.4+multijob-windows-compatibility\server\build\testrun\integTest\J2\temp\org.elasticsearch.routing.PartitionedRoutingIT_1C4D221BAD62999F-001\tempDir-002\data\nodes\2\indices\SI6IoZJ5R_Wb4Nqx06oDmA\3\index\recovery.CB9P345NRmmuu-tPzwjksg._0.cfs -> C:\Users\jenkins\workspace\elastic+elasticsearch+6.4+multijob-windows-compatibility\server\build\testrun\integTest\J2\temp\org.elasticsearch.routing.PartitionedRoutingIT_1C4D221BAD62999F-001\tempDir-002\data\nodes\2\indices\SI6IoZJ5R_Wb4Nqx06oDmA\3\index\_0.cfs
Expand for stack traceorg.elasticsearch.indices.recovery.RecoveryFailedException: [second_split][0]: Recovery failed from {node_sd2}{hcRrVZwPQ7e8MTakDgz_GQ}{wfXHyRV6RPW0sKTXdamhdA}{127.0.0.1}{127.0.0.1:58177} into {node_sd3}{VlMAJBDSTemm3IJZgPciLA}{zPecWbBiSWaagZwl7Q1l6g}{127.0.0.1}{127.0.0.1:58171} 1> at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.doRecovery(PeerRecoveryTargetService.java:282) [main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.access$900(PeerRecoveryTargetService.java:80) [main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$RecoveryRunner.doRun(PeerRecoveryTargetService.java:623) [main/:?] 1> at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723) [main/:?] 1> at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?] 1> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181] 1> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181] 1> at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181] 1> Caused by: org.elasticsearch.transport.RemoteTransportException: [node_sd2][127.0.0.1:58177][internal:index/shard/recovery/start_recovery] 1> Caused by: org.elasticsearch.index.engine.RecoveryEngineException: Phase[1] phase1 failed 1> at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:174) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:98) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$000(PeerRecoverySourceService.java:50) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:107) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:104) ~[main/:?] 1> at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) ~[main/:?] 1> at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[main/:?] 1> at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1605) ~[main/:?] 1> ... 5 more 1> Caused by: org.elasticsearch.indices.recovery.RecoverFilesRecoveryException: Failed to transfer [5] files with total size of [5.6kb] 1> at org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:446) ~[main/:?] 1> at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:172) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:98) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$000(PeerRecoverySourceService.java:50) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:107) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:104) ~[main/:?] 1> at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) ~[main/:?] 1> at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[main/:?] 1> at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1605) ~[main/:?] 1> ... 5 more 1> Caused by: org.elasticsearch.transport.RemoteTransportException: [node_sd3][127.0.0.1:58171][internal:index/shard/recovery/clean_files] 1> Caused by: java.nio.file.AccessDeniedException: C:\Users\jenkins\workspace\elastic+elasticsearch+6.4+multijob-windows-compatibility\server\build\testrun\integTest\J2\temp\org.elasticsearch.action.admin.indices.create.SplitIndexIT_1C4D221BAD62999F-001\tempDir-002\data\nodes\3\indices\xKk-zXSXQsGpeMmq5ysVpA\0\index\recovery._dztPVO1RBKeP7ZpRHIbiA._0.cfs -> C:\Users\jenkins\workspace\elastic+elasticsearch+6.4+multijob-windows-compatibility\server\build\testrun\integTest\J2\temp\org.elasticsearch.action.admin.indices.create.SplitIndexIT_1C4D221BAD62999F-001\tempDir-002\data\nodes\3\indices\xKk-zXSXQsGpeMmq5ysVpA\0\index\_0.cfs 1> at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:83) ~[?:?] 1> at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97) ~[?:?] 1> at sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:301) ~[?:?] 1> at sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.java:287) ~[?:?] 1> at org.apache.lucene.mockfile.FilterFileSystemProvider.move(FilterFileSystemProvider.java:147) ~[lucene-test-framework-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - jpountz - 2018-06-18 16:51:47] 1> at org.apache.lucene.mockfile.FilterFileSystemProvider.move(FilterFileSystemProvider.java:147) ~[lucene-test-framework-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - jpountz - 2018-06-18 16:51:47] 1> at org.apache.lucene.mockfile.FilterFileSystemProvider.move(FilterFileSystemProvider.java:147) ~[lucene-test-framework-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - jpountz - 2018-06-18 16:51:47] 1> at org.apache.lucene.mockfile.FilterFileSystemProvider.move(FilterFileSystemProvider.java:147) ~[lucene-test-framework-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - jpountz - 2018-06-18 16:51:47] 1> at java.nio.file.Files.move(Files.java:1395) ~[?:1.8.0_181] 1> at org.apache.lucene.store.FSDirectory.rename(FSDirectory.java:303) ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - jpountz - 2018-06-18 16:51:45] 1> at org.apache.lucene.store.FilterDirectory.rename(FilterDirectory.java:89) ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - jpountz - 2018-06-18 16:51:45] 1> at org.apache.lucene.store.FilterDirectory.rename(FilterDirectory.java:89) ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - jpountz - 2018-06-18 16:51:45] 1> at org.elasticsearch.index.store.Store.renameTempFilesSafe(Store.java:337) ~[main/:?] 1> at org.elasticsearch.indices.recovery.RecoveryTarget.renameAllTempFiles(RecoveryTarget.java:188) ~[main/:?] 1> at org.elasticsearch.indices.recovery.RecoveryTarget.cleanFiles(RecoveryTarget.java:439) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$CleanFilesRequestHandler.messageReceived(PeerRecoveryTargetService.java:556) ~[main/:?] 1> at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$CleanFilesRequestHandler.messageReceived(PeerRecoveryTargetService.java:550) ~[main/:?] 1> at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) ~[main/:?] 1> at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[main/:?] 1> at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1605) ~[main/:?] |
Looking harder at the logs, what happens is:
However we failed to delete the unassigned shards from the first node after the rebalance (at least twice, maybe it succeeds, we're still looking). For some reason when we try and allocate a shard on top of the old directory it then fails with the |
Note that we're using |
I have tried to find a way to reproduce this with fewer moving parts, but so far this has failed. The following test passes.
|
One interesting side-question is why this happens more on tests to do with shrinking. I think this is the answer: Lines 726 to 727 in c4b8316
In other cases, I guess an allocation that fails like this will retry (up to 5 times) and subsequent attempts will succeed. Correction: subsequent attempts to allocate a copy of the shrunk index onto the node containing the source index will, I think, all fail, because of #33857 (comment) (unless either the source index is removed or closed, or else all the shared segments are merged away, neither of which seems very likely in this test). |
To test this hypothesis, I tried writing a test that repeatedly allocates a shard to a node and then removes it, hoping to hit an allocation failure, with max retries set to 0:
This passes. I'm mystified. |
Good catch. We should clean these up IMO. |
Ok, I've got it 🎉 The following test passes on Windows (i.e. throws an
More precisely, attempts to delete This only happens when using |
Still experiencing failure today on 6.4 REPRODUCE WITH: gradlew :server:integTest and 6.x 07:44:40 FAILURE 33.3s J2 | PartitionedRoutingIT.testShrinking <<< FAILURES! |
I have muted |
|
We've worked out what's going on here so I've unassigned myself from this issue. Similar failures on Windows should be muted until we work out what to do with it - no need to report them here. |
This is a Windows limitation with MMapDirectory we cannot do anything about -> Closing |
5.6
and
It also happened on 6.4, 6.x and master
I was not able to reproduce it on a Windows 2012 VM.
There's no evidence of similar failures in the past, nor recent changes to the 5.6
that are obviously related.
The text was updated successfully, but these errors were encountered: