-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix leak of launcher on disconnect #117
Conversation
- afterDisconnect is called asynchronously from the deletion of the Node. Ensure that we continue to process in the event that `slaveComputer.getNode()` returns null, as we still have to ensure that the launcher is torn down. Without this, SSHLauncher could sometimes leak sockets.
@pjdarton Any thoughts on this PR? |
It sounds plausible, it looks like you've done your homework and, at a (very) brief glance, I don't see anything objectionable in the code changes. However, while I can spare time to press the "merge" button, I don't have enough spare time to test this myself (at all - 110% busy on other stuff) .... so, if you want this merged, you're going to have to test it yourself. I agree that, yes, there is "a lot going on" in that code (I don't properly understand it, and there's nothing noteworthy that'll unit-test it). This code pre-dates my involvement with this plugin, and (I believe) pre-dates Jenkins having "cloud" support at all - its complexity results from making Jenkins do cloud-like functionality using statically provisioned slaves ... and so that's what needs testing (as well as testing that your code changes achieve the effect you want too) before we can release this. |
Fair enough! I was hoping that you (or someone) had a lab that was ready-to-go. I'll get on this. |
I have a server room busy running builds - it's not a server room laying idle awaiting vSphere plugin testing :-) |
No worries. I’ll do my best to put it through the wringer and, if possible, add unit tests. It might be best to extract out the bulk of the code contained in afterDisconnect, just to get it into a place where it’s easier to test. |
FYI, I'm still working on this. However, while trying to explore the various idle action code paths, I've run into some questions and a bug. I can't see any practical way that anyone would use idle actions. Am I missing something?
Bug when reconfiguring a Jenkins build node started by the vSphere Cloud plugin My changes have exposed a bug in the handling of reconfiguring a build node that had been started by the vSphere Cloud plugin. When I reconfigure the running build node (ex: when I might configure the idle action), I found that the build node will throw an NullPointer exception when it gets torn down:
It turns out that when you reconfigure a running build node, clicking the I reverted my changes and confirmed that this |
I modified the Anyway, it seems like there are two possible solutions:
or
Exception:
|
AIUI, the "on idle" actions were designed before Jenkins had a cloud API so that folks could manually create VMs in vSphere and manually define vSphere-slave nodes in Jenkins with instructions to revert those VMs to a snapshot or similar in between builds. It's not much of a surprise if it's not possible to change the settings on a dynamically-provisioned slave. |
Alright, then I'm going to file a separate issue about the saving of the slave configuration. It is, indeed, a rabbit hole. |
Note: I see the same behavior in Jenkins 2.60.3 when running locally using With no ability to manually create a vsphere-based agent, I've got no way to test any impact on idle actions. Is there really anything else that I can test? |
Um, I'm not sure why you're not seeing any ability to create a vSphere node; that should be possible; it certainly used to be possible (in fact, at one point, it was possible to manually create the kind of node we dynamically create rather than just the static kind). Try older versions of Jenkins; it's possible that there's a breaking-compatibility change in very recent versions of Jenkins (which should be dealt with as a separate bug) |
Already tried with Jenkins 2.60.3 (via |
Oh. That's, um, disappointing. I wonder WTF broke that... Well, I guess that gets you off the hook then; you can't break what's already impossible. |
If you're happy with the code changes, I'm happy to push the merge button. |
@pjdarton I'm good with the changes! |
I'll file a ticket re: inability to manually create a vsphere build node |
I have found that the vSphere Cloud plugin does not always tear down the underlying launcher when it is tearing down its build slaves. In my configuration, this leads to a leak of SSH sockets.
My environment makes almost exclusive use of run-once vSphere templates using the SSH launch agent, executing ~2000 per day. We've noticed that our Jenkins instance will accumulate ESTABLISHED TCP sockets to SSH services on vSphere hosts at a rate of ~1000 per week. Over time, this can lead to an exhaustion of file handles.
A simple reproduction involves:
Running lsof on against the jenkins process (ex: 12345) shows an accumulation of SSH sockets for build slaves that no longer exist:
Cause
vSphereCloudLauncher.afterDisconnect
would not call through to the underlying launcher'safterDisconnect
whenslaveComputer.getNode()
returned null. This is problematic because SSHLauncher's teardown of the socket relies on itsafterDisconnect
being called.According to the documentation for afterDisconnect:
I found it helpful to implement a NodeListener that would let me know when the node was deleted. In cases where the leak occurs, I would see a
NODE DELETED: mynode1
message, followed by aSlave is null
.Fix
Modify
vSphereCloudLauncher.afterDisconnect
so that it calls through to the launcher'safterDisconnect
even thoughslaveComputer.getNode()
may have returned null.Caveats
vSphereCloudLauncher.afterDisconnect
, and I am unable to confirm that the changes I've made affect those code paths. Almost none of it is in the code path for my environment (our vsphere templates always seem to have an idle action ofNOTHING
).afterDisconnect
will be called multiple times --- I don't know if this is problematic for the idle action code. If it is, then it might be a good idea to move that code outside ofafterDisconnect
..