Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jenkins is unable to keep the slaves online after they are started #621

Closed
nachevn opened this issue Feb 21, 2018 · 4 comments
Closed

Jenkins is unable to keep the slaves online after they are started #621

nachevn opened this issue Feb 21, 2018 · 4 comments

Comments

@nachevn
Copy link

nachevn commented Feb 21, 2018

  • [1.1.3] docker-plugin version you use
  • [2.73.1] jenkins version you use
  • [Version = 1.9.1, API Version = 1.21] docker engine version you use

The problem is that the containers are killed shortly after they were started and the jobs are sticking in the build queue. Furthermore, after the update from 1.1.2 to 1.1.3 all the "killed" container configurations putted into the ${JENKINS_HOME}/config-history/nodes/ folder as already deleted nodes e.g.

${JENKINS_HOME}/config-history/nodes # ls -la docker-6
Display all 822 possibilities? (y or n)
docker-634b6da5ae0ff/                             docker-63a1ec0e1867c_deleted_20180221_095242_395/ docker-6406710c7a4c1_deleted_20180221_114754_370/ docker-642d246f02c17_deleted_20180221_124402_735/
docker-63744b3f8984f_deleted_20180221_090932_779/ docker-63a1ec0e1867c_deleted_20180221_095244_376/ docker-640671629b4cd_deleted_20180221_114749_254/ docker-642e0cc028348_deleted_20180221_124632_078/
docker-637c800ca5435_deleted_20180221_091050_098/ docker-63a20f933864d_deleted_20180221_095252_147/
....

This 822 deleted nodes were created only for 3 hours...

Here also this part of the log from Jenkins...

....
21-Feb-2018 12:57:13.868 INFO [jenkins.util.Timer [#8]] hudson.slaves.NodeProvisioner$2.run Image of servicelayer/slbuild:latest provisioning successfully completed. We have now 19 computer(s)
21-Feb-2018 12:57:13.868 INFO [jenkins.util.Timer [#8]] com.nirima.jenkins.plugins.docker.DockerCloud.provision Asked to provision 1 slave(s) for: docker-slbuild
21-Feb-2018 12:57:13.868 INFO [jenkins.util.Timer [#8]] com.nirima.jenkins.plugins.docker.DockerCloud.provision Will provision 'servicelayer/slbuild:latest', for label: 'docker-slbuild', in cloud: 'docker@cloud'
21-Feb-2018 12:57:13.870 SEVERE [dockerjava-netty-17-8] com.github.dockerjava.core.async.ResultCallbackTemplate.onError Error during callback
 com.github.dockerjava.api.exception.NotFoundException: no such id: 5a587d46ed57265894bc3103cb72307c124c563c0a04b70aba7363f25eae4e1f

        at com.github.dockerjava.netty.handler.HttpResponseHandler.channelRead0(HttpResponseHandler.java:103)
        at com.github.dockerjava.netty.handler.HttpResponseHandler.channelRead0(HttpResponseHandler.java:33)
        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:438)
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:284)
        at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:253)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:579)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:496)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
        at java.lang.Thread.run(Thread.java:745)

21-Feb-2018 12:57:13.875 INFO [jenkins.util.Timer [#8]] com.nirima.jenkins.plugins.docker.DockerCloud.canAddProvisionedSlave Provisioning 'servicelayer/slbuild:latest' number 2 (of 10) on 'docker@cloud'; Total containers: 6 (of 12)
21-Feb-2018 12:57:13.875 INFO [jenkins.util.Timer [#8]] hudson.slaves.NodeProvisioner$StandardStrategyImpl.apply Started provisioning Image of servicelayer/slbuild:latest from docker@cloud with 1 executors. Remaining excess workload: 0.156
21-Feb-2018 12:57:13.875 INFO [Computer.threadPoolForRemoting [#645]] com.nirima.jenkins.plugins.docker.DockerTemplate.pullImage Pulling image 'servicelayer/slbuild:latest'. This may take awhile...
21-Feb-2018 12:57:14.471 SEVERE [dockerjava-netty-17-9] com.github.dockerjava.core.async.ResultCallbackTemplate.onError Error during callback
 com.github.dockerjava.api.exception.NotFoundException: no such id: 5a587d46ed57265894bc3103cb72307c124c563c0a04b70aba7363f25eae4e1f

        at com.github.dockerjava.netty.handler.HttpResponseHandler.channelRead0(HttpResponseHandler.java:103)
        at com.github.dockerjava.netty.handler.HttpResponseHandler.channelRead0(HttpResponseHandler.java:33)
        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:438)
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:284)
        at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:253)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:579)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:496)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
        at java.lang.Thread.run(Thread.java:745)

21-Feb-2018 12:57:14.471 SEVERE [dockerjava-netty-18-13] com.github.dockerjava.core.async.ResultCallbackTemplate.onError Error during callback
 com.github.dockerjava.api.exception.NotModifiedException:
        at com.github.dockerjava.netty.handler.HttpResponseHandler.channelRead0(HttpResponseHandler.java:97)
        at com.github.dockerjava.netty.handler.HttpResponseHandler.channelRead0(HttpResponseHandler.java:33)
        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:438)
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:284)
        at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:253)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:579)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:496)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
        at java.lang.Thread.run(Thread.java:745)

21-Feb-2018 12:57:14.535 SEVERE [dockerjava-netty-17-10] com.github.dockerjava.core.async.ResultCallbackTemplate.onError Error during callback
 com.github.dockerjava.api.exception.NotFoundException: no such id: 5a587d46ed57265894bc3103cb72307c124c563c0a04b70aba7363f25eae4e1f

        at com.github.dockerjava.netty.handler.HttpResponseHandler.channelRead0(HttpResponseHandler.java:103)
        at com.github.dockerjava.netty.handler.HttpResponseHandler.channelRead0(HttpResponseHandler.java:33)
        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:438)
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:284)
        at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:253)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:579)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:496)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
        at java.lang.Thread.run(Thread.java:745)

21-Feb-2018 12:57:17.276 INFO [Computer.threadPoolForRemoting [#645]] com.nirima.jenkins.plugins.docker.DockerTemplate.pullImage Finished pulling image 'servicelayer/slbuild:latest', took 3401 ms
21-Feb-2018 12:57:17.277 INFO [Computer.threadPoolForRemoting [#645]] com.nirima.jenkins.plugins.docker.DockerTemplate.provisionNode Trying to run container for servicelayer/slbuild:latest
21-Feb-2018 12:57:17.277 INFO [Computer.threadPoolForRemoting [#645]] com.nirima.jenkins.plugins.docker.DockerTemplate.provisionNode Trying to run container 644325152e671 from image: servicelayer/slbuild:latest
21-Feb-2018 12:57:17.525 INFO [Computer.threadPoolForRemoting [#666]] com.nirima.jenkins.plugins.docker.DockerTemplate.pullImage Finished pulling image 'businesslayer/blbuild:latest', took 3660 ms
21-Feb-2018 12:57:17.525 INFO [Computer.threadPoolForRemoting [#666]] com.nirima.jenkins.plugins.docker.DockerTemplate.provisionNode Trying to run container for businesslayer/blbuild:latest
21-Feb-2018 12:57:17.525 INFO [Computer.threadPoolForRemoting [#666]] com.nirima.jenkins.plugins.docker.DockerTemplate.provisionNode Trying to run container 6443260218731 from image: businesslayer/blbuild:latest
21-Feb-2018 12:57:21.174 INFO [Computer.threadPoolForRemoting [#645]] com.nirima.jenkins.plugins.docker.utils.PortUtils$ConnectionCheckSSH.execute SSH port is open on cloud.nohostname.net:47209
21-Feb-2018 12:57:21.759 INFO [Computer.threadPoolForRemoting [#666]] com.nirima.jenkins.plugins.docker.utils.PortUtils$ConnectionCheckSSH.execute SSH port is open on cloud.nohostname.net:47210
[02/21/18 12:57:21] SSH Launch of docker-644325152e671 on cloud.nohostname.net failed in 192 ms
[02/21/18 12:57:22] SSH Launch of docker-6443260218731 on cloud.nohostname.net failed in 276 ms
21-Feb-2018 12:57:23.857 INFO [jenkins.util.Timer [#9]] hudson.slaves.NodeProvisioner$2.run Image of businesslayer/blbuild:latest provisioning successfully completed. We have now 17 computer(s)
21-Feb-2018 12:57:23.858 INFO [jenkins.util.Timer [#9]] com.nirima.jenkins.plugins.docker.DockerCloud.provision Asked to provision 1 slave(s) for: docker-blbuild
....

I'm not really sure where the problem is, but after a couple of retries (don't know how many, but really many, many retries) a container is started and the job is executed. Unfortunately this takes a lot of time and leaves a lot of "deleted" slaves...I would really appreciate any help.

@nachevn nachevn changed the title Jenkins is unable to keep the slave online after the it is startet Jenkins is unable to keep the slaves online after the they are started Feb 21, 2018
@nachevn nachevn changed the title Jenkins is unable to keep the slaves online after the they are started Jenkins is unable to keep the slaves online after they are started Feb 21, 2018
@SansGuidon
Copy link

SansGuidon commented Feb 22, 2018

  • [1.1.3] docker-plugin version you use
  • [2.89.4] jenkins version you use
  • [Version = 1.10.3, API Version = 1.22] docker engine version you use

same issue here, after updating from 0.16.2 to 1.1.3. In last 24 hours.... almost 10K containers too promptly killed. Containers usually start after a few minutes but consequence is jobs end in build queue for too long.
-bash-4.1$ find /data/jenkins2/config-history/nodes -mtime -1 -name "docker*" | wc -l 9672

@pjdarton
Copy link
Member

I think that #623 may be related here.
Where I work, I've now lost count of the number of people who've found that the "idle timeout" is set to zero, causing containers to get deleted before they've had a chance to start work. This is, I suspect, a consequence of a change in the way this data was persisted in the XML (as a result of a refactor several months ago) - I think the code was writing "timeout" but reading "idleMinutes" and this changeover meant that a lot of people ended up with a timeout of zero after upgrading.
Anyone in this thread who wants to try out my theory is welcome to try out the plugin from the CI build of that pull request.

I've also noticed (during my own testing) that com.github.dockerjava.api.exception.NotFoundException: no such id: gets logged by the docker-java library as a part of normal operation.
i.e. this looks a lot worse than it really is.
Any really juicy exceptions get logged by the docker-plugin itself, so I would suggest that you don't need to pay much attention to these (over-stated) errors flagged up by the docker-java library (and, in any case, it's bad form to log a perfectly normal "not found" exception as well as to throw it, especially to log it as a "severe" exception ... but that's an issue with the docker-java library, not the docker-plugin code).
My guess as to the reason for those is that DockerTransientNode.terminate(...) is being called more than once and the second (and subsequent) times it gets called it'll find that the container doesn't exist. I've seen code in other cloud plugins where there's a lot of defensive coding around that sort of thing happening, so I'd guess that this is a "known issue" with the way Jenkins handles terminating nodes.

@nachevn
Copy link
Author

nachevn commented Feb 22, 2018

Thanks @pjdarton You hint helped me a lot! After the increasing of the idleTimeout from 0 to 1, the containers started again normally.

Another question: I'm still not quite sure is there a way to limit the number of executor of one docker slave node? Currently, my feeling is that only one job can be executed within one started container, that is also my goal. But is this really true?

@pjdarton
Copy link
Member

One job per container is standard behaviour, yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants