Jenkins is unable to keep the slaves online after they are started #621

nachevn · 2018-02-21T12:30:02Z

[1.1.3] docker-plugin version you use
[2.73.1] jenkins version you use
[Version = 1.9.1, API Version = 1.21] docker engine version you use

The problem is that the containers are killed shortly after they were started and the jobs are sticking in the build queue. Furthermore, after the update from 1.1.2 to 1.1.3 all the "killed" container configurations putted into the ${JENKINS_HOME}/config-history/nodes/ folder as already deleted nodes e.g.

${JENKINS_HOME}/config-history/nodes # ls -la docker-6
Display all 822 possibilities? (y or n)
docker-634b6da5ae0ff/                             docker-63a1ec0e1867c_deleted_20180221_095242_395/ docker-6406710c7a4c1_deleted_20180221_114754_370/ docker-642d246f02c17_deleted_20180221_124402_735/
docker-63744b3f8984f_deleted_20180221_090932_779/ docker-63a1ec0e1867c_deleted_20180221_095244_376/ docker-640671629b4cd_deleted_20180221_114749_254/ docker-642e0cc028348_deleted_20180221_124632_078/
docker-637c800ca5435_deleted_20180221_091050_098/ docker-63a20f933864d_deleted_20180221_095252_147/
....

This 822 deleted nodes were created only for 3 hours...

Here also this part of the log from Jenkins...

....
21-Feb-2018 12:57:13.868 INFO [jenkins.util.Timer [#8]] hudson.slaves.NodeProvisioner$2.run Image of servicelayer/slbuild:latest provisioning successfully completed. We have now 19 computer(s)
21-Feb-2018 12:57:13.868 INFO [jenkins.util.Timer [#8]] com.nirima.jenkins.plugins.docker.DockerCloud.provision Asked to provision 1 slave(s) for: docker-slbuild
21-Feb-2018 12:57:13.868 INFO [jenkins.util.Timer [#8]] com.nirima.jenkins.plugins.docker.DockerCloud.provision Will provision 'servicelayer/slbuild:latest', for label: 'docker-slbuild', in cloud: 'docker@cloud'
21-Feb-2018 12:57:13.870 SEVERE [dockerjava-netty-17-8] com.github.dockerjava.core.async.ResultCallbackTemplate.onError Error during callback
 com.github.dockerjava.api.exception.NotFoundException: no such id: 5a587d46ed57265894bc3103cb72307c124c563c0a04b70aba7363f25eae4e1f

        at com.github.dockerjava.netty.handler.HttpResponseHandler.channelRead0(HttpResponseHandler.java:103)
        at com.github.dockerjava.netty.handler.HttpResponseHandler.channelRead0(HttpResponseHandler.java:33)
        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:438)
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:284)
        at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:253)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:579)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:496)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
        at java.lang.Thread.run(Thread.java:745)

21-Feb-2018 12:57:13.875 INFO [jenkins.util.Timer [#8]] com.nirima.jenkins.plugins.docker.DockerCloud.canAddProvisionedSlave Provisioning 'servicelayer/slbuild:latest' number 2 (of 10) on 'docker@cloud'; Total containers: 6 (of 12)
21-Feb-2018 12:57:13.875 INFO [jenkins.util.Timer [#8]] hudson.slaves.NodeProvisioner$StandardStrategyImpl.apply Started provisioning Image of servicelayer/slbuild:latest from docker@cloud with 1 executors. Remaining excess workload: 0.156
21-Feb-2018 12:57:13.875 INFO [Computer.threadPoolForRemoting [#645]] com.nirima.jenkins.plugins.docker.DockerTemplate.pullImage Pulling image 'servicelayer/slbuild:latest'. This may take awhile...
21-Feb-2018 12:57:14.471 SEVERE [dockerjava-netty-17-9] com.github.dockerjava.core.async.ResultCallbackTemplate.onError Error during callback
 com.github.dockerjava.api.exception.NotFoundException: no such id: 5a587d46ed57265894bc3103cb72307c124c563c0a04b70aba7363f25eae4e1f

        at com.github.dockerjava.netty.handler.HttpResponseHandler.channelRead0(HttpResponseHandler.java:103)
        at com.github.dockerjava.netty.handler.HttpResponseHandler.channelRead0(HttpResponseHandler.java:33)
        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:438)
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:284)
        at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:253)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:579)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:496)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
        at java.lang.Thread.run(Thread.java:745)

21-Feb-2018 12:57:14.471 SEVERE [dockerjava-netty-18-13] com.github.dockerjava.core.async.ResultCallbackTemplate.onError Error during callback
 com.github.dockerjava.api.exception.NotModifiedException:
        at com.github.dockerjava.netty.handler.HttpResponseHandler.channelRead0(HttpResponseHandler.java:97)
        at com.github.dockerjava.netty.handler.HttpResponseHandler.channelRead0(HttpResponseHandler.java:33)
        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:438)
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:284)
        at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:253)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:579)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:496)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
        at java.lang.Thread.run(Thread.java:745)

21-Feb-2018 12:57:14.535 SEVERE [dockerjava-netty-17-10] com.github.dockerjava.core.async.ResultCallbackTemplate.onError Error during callback
 com.github.dockerjava.api.exception.NotFoundException: no such id: 5a587d46ed57265894bc3103cb72307c124c563c0a04b70aba7363f25eae4e1f

        at com.github.dockerjava.netty.handler.HttpResponseHandler.channelRead0(HttpResponseHandler.java:103)
        at com.github.dockerjava.netty.handler.HttpResponseHandler.channelRead0(HttpResponseHandler.java:33)
        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:438)
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:284)
        at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:253)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:579)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:496)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
        at java.lang.Thread.run(Thread.java:745)

21-Feb-2018 12:57:17.276 INFO [Computer.threadPoolForRemoting [#645]] com.nirima.jenkins.plugins.docker.DockerTemplate.pullImage Finished pulling image 'servicelayer/slbuild:latest', took 3401 ms
21-Feb-2018 12:57:17.277 INFO [Computer.threadPoolForRemoting [#645]] com.nirima.jenkins.plugins.docker.DockerTemplate.provisionNode Trying to run container for servicelayer/slbuild:latest
21-Feb-2018 12:57:17.277 INFO [Computer.threadPoolForRemoting [#645]] com.nirima.jenkins.plugins.docker.DockerTemplate.provisionNode Trying to run container 644325152e671 from image: servicelayer/slbuild:latest
21-Feb-2018 12:57:17.525 INFO [Computer.threadPoolForRemoting [#666]] com.nirima.jenkins.plugins.docker.DockerTemplate.pullImage Finished pulling image 'businesslayer/blbuild:latest', took 3660 ms
21-Feb-2018 12:57:17.525 INFO [Computer.threadPoolForRemoting [#666]] com.nirima.jenkins.plugins.docker.DockerTemplate.provisionNode Trying to run container for businesslayer/blbuild:latest
21-Feb-2018 12:57:17.525 INFO [Computer.threadPoolForRemoting [#666]] com.nirima.jenkins.plugins.docker.DockerTemplate.provisionNode Trying to run container 6443260218731 from image: businesslayer/blbuild:latest
21-Feb-2018 12:57:21.174 INFO [Computer.threadPoolForRemoting [#645]] com.nirima.jenkins.plugins.docker.utils.PortUtils$ConnectionCheckSSH.execute SSH port is open on cloud.nohostname.net:47209
21-Feb-2018 12:57:21.759 INFO [Computer.threadPoolForRemoting [#666]] com.nirima.jenkins.plugins.docker.utils.PortUtils$ConnectionCheckSSH.execute SSH port is open on cloud.nohostname.net:47210
[02/21/18 12:57:21] SSH Launch of docker-644325152e671 on cloud.nohostname.net failed in 192 ms
[02/21/18 12:57:22] SSH Launch of docker-6443260218731 on cloud.nohostname.net failed in 276 ms
21-Feb-2018 12:57:23.857 INFO [jenkins.util.Timer [#9]] hudson.slaves.NodeProvisioner$2.run Image of businesslayer/blbuild:latest provisioning successfully completed. We have now 17 computer(s)
21-Feb-2018 12:57:23.858 INFO [jenkins.util.Timer [#9]] com.nirima.jenkins.plugins.docker.DockerCloud.provision Asked to provision 1 slave(s) for: docker-blbuild
....

I'm not really sure where the problem is, but after a couple of retries (don't know how many, but really many, many retries) a container is started and the job is executed. Unfortunately this takes a lot of time and leaves a lot of "deleted" slaves...I would really appreciate any help.

The text was updated successfully, but these errors were encountered:

SansGuidon · 2018-02-22T09:37:59Z

[1.1.3] docker-plugin version you use
[2.89.4] jenkins version you use
[Version = 1.10.3, API Version = 1.22] docker engine version you use

same issue here, after updating from 0.16.2 to 1.1.3. In last 24 hours.... almost 10K containers too promptly killed. Containers usually start after a few minutes but consequence is jobs end in build queue for too long.
-bash-4.1$ find /data/jenkins2/config-history/nodes -mtime -1 -name "docker*" | wc -l 9672

pjdarton · 2018-02-22T16:32:27Z

I think that #623 may be related here.
Where I work, I've now lost count of the number of people who've found that the "idle timeout" is set to zero, causing containers to get deleted before they've had a chance to start work. This is, I suspect, a consequence of a change in the way this data was persisted in the XML (as a result of a refactor several months ago) - I think the code was writing "timeout" but reading "idleMinutes" and this changeover meant that a lot of people ended up with a timeout of zero after upgrading.
Anyone in this thread who wants to try out my theory is welcome to try out the plugin from the CI build of that pull request.

I've also noticed (during my own testing) that com.github.dockerjava.api.exception.NotFoundException: no such id: gets logged by the docker-java library as a part of normal operation.
i.e. this looks a lot worse than it really is.
Any really juicy exceptions get logged by the docker-plugin itself, so I would suggest that you don't need to pay much attention to these (over-stated) errors flagged up by the docker-java library (and, in any case, it's bad form to log a perfectly normal "not found" exception as well as to throw it, especially to log it as a "severe" exception ... but that's an issue with the docker-java library, not the docker-plugin code).
My guess as to the reason for those is that DockerTransientNode.terminate(...) is being called more than once and the second (and subsequent) times it gets called it'll find that the container doesn't exist. I've seen code in other cloud plugins where there's a lot of defensive coding around that sort of thing happening, so I'd guess that this is a "known issue" with the way Jenkins handles terminating nodes.

nachevn · 2018-02-22T20:01:23Z

Thanks @pjdarton You hint helped me a lot! After the increasing of the idleTimeout from 0 to 1, the containers started again normally.

Another question: I'm still not quite sure is there a way to limit the number of executor of one docker slave node? Currently, my feeling is that only one job can be executed within one started container, that is also my goal. But is this really true?

pjdarton · 2018-02-22T21:16:20Z

One job per container is standard behaviour, yes.

nachevn changed the title ~~Jenkins is unable to keep the slave online after the it is startet~~ Jenkins is unable to keep the slaves online after the they are started Feb 21, 2018

nachevn changed the title ~~Jenkins is unable to keep the slaves online after the they are started~~ Jenkins is unable to keep the slaves online after they are started Feb 21, 2018

BlueMesaDev mentioned this issue Feb 26, 2018

Docker Slaves die randomly during build. #628

Closed

pjdarton closed this as completed Jul 17, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jenkins is unable to keep the slaves online after they are started #621

Jenkins is unable to keep the slaves online after they are started #621

nachevn commented Feb 21, 2018 •

edited

Loading

SansGuidon commented Feb 22, 2018 •

edited

Loading

pjdarton commented Feb 22, 2018

nachevn commented Feb 22, 2018

pjdarton commented Feb 22, 2018

Jenkins is unable to keep the slaves online after they are started #621

Jenkins is unable to keep the slaves online after they are started #621

Comments

nachevn commented Feb 21, 2018 • edited Loading

SansGuidon commented Feb 22, 2018 • edited Loading

pjdarton commented Feb 22, 2018

nachevn commented Feb 22, 2018

pjdarton commented Feb 22, 2018

nachevn commented Feb 21, 2018 •

edited

Loading

SansGuidon commented Feb 22, 2018 •

edited

Loading