Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build: clear stalled jobs on POSIX CI hosts #11246

Closed
wants to merge 1 commit into from

Conversation

Trott
Copy link
Member

@Trott Trott commented Feb 8, 2017

Sometimes, after a cluster or debug test fails, a fixture hangs around
and holds onto a needed port, causing subsequent CI runs to fail. This
adds a command I've been running manually when this occurs. The command
will clear the stalled jobs before a CI run.

Checklist
  • make -j4 test (UNIX), or vcbuild test (Windows) passes
  • commit message follows commit guidelines
Affected core subsystem(s)

build

@Trott Trott added build Issues and PRs related to build files or the CI. wip Issues and PRs that are still a work in progress. labels Feb 8, 2017
@nodejs-github-bot nodejs-github-bot added the build Issues and PRs related to build files or the CI. label Feb 8, 2017
Makefile Outdated
@@ -188,6 +188,9 @@ test/addons/.buildstamp: config.gypi \
# TODO(bnoordhuis) Force rebuild after gyp update.
build-addons: $(NODE_EXE) test/addons/.buildstamp

clear-stalled:
ps awwwx | grep Release/node | grep -v grep | awk '{print $1}' | xargs kill
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about simply using pkill instead? Isn't that available on all *nix platforms in CI?

Copy link
Member Author

@Trott Trott Feb 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not on our AIX hosts.

@mscdex mscdex added the test Issues and PRs related to the tests. label Feb 8, 2017
@Trott Trott force-pushed the terminate-stalled branch 3 times, most recently from e537333 to f402ec6 Compare February 8, 2017 20:26
@Trott Trott force-pushed the terminate-stalled branch from f402ec6 to f4695f4 Compare February 8, 2017 21:16
@Trott
Copy link
Member Author

Trott commented Feb 8, 2017

@Trott
Copy link
Member Author

Trott commented Feb 8, 2017

CI is green. Removing WIP from the title and removing the in progress label.

/cc @nodejs/build for reviews

@Trott Trott removed the wip Issues and PRs that are still a work in progress. label Feb 8, 2017
@Trott Trott changed the title WIP: build: clear stalled jobs on POSIX CI hosts build: clear stalled jobs on POSIX CI hosts Feb 8, 2017
Copy link
Member

@mhdawson mhdawson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for this, we've had to clean the AIX machines several times in the last few days so this is great to see.

@Trott
Copy link
Member Author

Trott commented Feb 9, 2017

I'm terrible at grok'ing Makefile stuff. I put this as an order-only prerequisite. If someone knowledgable could indicate whether that's the right thing to do or not in this case, that would be appreciated.

@richardlau
Copy link
Member

The clear-stalled target should be phony and added to .PHONY at the end of the makefile since it does not refer to a file or directory called clear-stalled.

If the target is declared phony, I'm not sure that it matters if it is order-only or not as a prerequisite.

@gibfahn
Copy link
Member

gibfahn commented Feb 9, 2017

Related to nodejs/build#591

As discussed in that issue, I'd rather have something in the test runner that kept track of processes that hadn't been cleaned up at the end of each run, and failed the job if there were any. However I do think this is better than what we currently have (later jobs randomly failing and @nodejs/build members having to clean up machines manually), so I'd be +1 on this for now.

Currently I don't think this is logging what processes it is cleaning up though, which is information I do think we need, so I'd be -1 on this without that.


I'm also not sure under what circumstances processes get left behind, doesn't tools/test.py clean up once the TIMEOUT happens? I assumed it was for orphaned node subprocesses, but I'm not sure.

Makefile Outdated
XARGS = xargs -r
endif
clear-stalled:
ps awwwx | grep Release/node | grep -v grep | awk '{print $$1}' | $(XARGS) kill
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could just add the same line up to the awk above to list the processes that will be killed, i.e.:

ps awwwx | grep Release/node | grep -v grep
ps awwwx | grep Release/node | grep -v grep | awk '{print $$1}' | $(XARGS) kill

@Trott
Copy link
Member Author

Trott commented Feb 9, 2017

I assumed it was for orphaned node subprocesses, but I'm not sure.

That seems right to me. Always seems to be subprocesses from cluster or debug tests.

@Trott Trott force-pushed the terminate-stalled branch from f4695f4 to 08735d5 Compare February 9, 2017 04:51
@Trott
Copy link
Member Author

Trott commented Feb 9, 2017

Implemented changes per @richardlau and @gibfahn.

CI: https://ci.nodejs.org/job/node-test-pull-request/6306/

Copy link
Member

@gibfahn gibfahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code addition LGTM, no idea about the Makefile syntax

@santigimeno
Copy link
Member

I'd rather have something in the test runner that kept track of processes that hadn't been cleaned up at the end of each run, and failed the job if there were any.

I agree on this. If a test is leaving processes behind I think it should fail.

Currently I don't think this is logging what processes it is cleaning up though, which is information I do think we need, so I'd be -1 on this without that.

I also agree on this.

@gibfahn
Copy link
Member

gibfahn commented Feb 9, 2017

@santigimeno So with the latest update this will now list the processes it's killing, so I'd say this is an improvement over what we currently do. Are you -1 on this landing as a step in the right direction?

If we could fix tools/test.py that would be great, but I'd rather have this in the meantime. Once the tools/test.py fix is in place, we could probably change this to make it fail the build if it finds any processes (as that means the tools/test.py cleanup failed).

@santigimeno
Copy link
Member

So with the latest update this will now list the processes it's killing, so I'd say this is an improvement over what we currently do. Are you -1 on this landing as a step in the right direction?

Sorry, I had overlooked the cat command. I'd rather have this running at the end of the CI run than before so we know when it really happened, but I agree this is an improvement.

@Trott Trott force-pushed the terminate-stalled branch from 08735d5 to 0140171 Compare February 9, 2017 18:10
@Trott
Copy link
Member Author

Trott commented Feb 9, 2017

Updated Makefile per nit from @joshgav.

Re-running CI: https://ci.nodejs.org/job/node-test-pull-request/6317/

@Trott
Copy link
Member Author

Trott commented Feb 9, 2017

@santigimeno wrote:

I'd rather have this running at the end of the CI run than before so we know when it really happened, but I agree this is an improvement.

Queued up for a subsequent PR: Trott@48693fc

Trott added a commit to Trott/io.js that referenced this pull request Feb 10, 2017
Sometimes, after a cluster or debug test fails, a fixture hangs around
and holds onto a needed port, causing subsequent CI runs to fail. This
adds a command I've been running manually when this occurs. The command
will clear the stalled jobs before a CI run.

PR-URL: nodejs#11246
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com>
Reviewed-By: Gibson Fahnestock <gibfahn@gmail.com>
Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
Reviewed-By: Josh Gavant <josh.gavant@outlook.com>
@Trott
Copy link
Member Author

Trott commented Feb 10, 2017

Landed in 90ab68b

@Trott Trott closed this Feb 10, 2017
italoacasas pushed a commit that referenced this pull request Feb 13, 2017
Sometimes, after a cluster or debug test fails, a fixture hangs around
and holds onto a needed port, causing subsequent CI runs to fail. This
adds a command I've been running manually when this occurs. The command
will clear the stalled jobs before a CI run.

PR-URL: #11246
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com>
Reviewed-By: Gibson Fahnestock <gibfahn@gmail.com>
Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
Reviewed-By: Josh Gavant <josh.gavant@outlook.com>
italoacasas pushed a commit to italoacasas/node that referenced this pull request Feb 14, 2017
Sometimes, after a cluster or debug test fails, a fixture hangs around
and holds onto a needed port, causing subsequent CI runs to fail. This
adds a command I've been running manually when this occurs. The command
will clear the stalled jobs before a CI run.

PR-URL: nodejs#11246
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com>
Reviewed-By: Gibson Fahnestock <gibfahn@gmail.com>
Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
Reviewed-By: Josh Gavant <josh.gavant@outlook.com>
krydos pushed a commit to krydos/node that referenced this pull request Feb 25, 2017
Sometimes, after a cluster or debug test fails, a fixture hangs around
and holds onto a needed port, causing subsequent CI runs to fail. This
adds a command I've been running manually when this occurs. The command
will clear the stalled jobs before a CI run.

PR-URL: nodejs#11246
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com>
Reviewed-By: Gibson Fahnestock <gibfahn@gmail.com>
Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
Reviewed-By: Josh Gavant <josh.gavant@outlook.com>
@jasnell
Copy link
Member

jasnell commented Mar 7, 2017

would need backport PRs to land on v6 and v4

@gibfahn
Copy link
Member

gibfahn commented Jun 17, 2017

I think it's worth backporting this to v6.x-staging. @Trott would you be willing to backport this?

guide

#12158 is a follow-up to this.

Trott added a commit to Trott/io.js that referenced this pull request Jun 18, 2017
Sometimes, after a cluster or debug test fails, a fixture hangs around
and holds onto a needed port, causing subsequent CI runs to fail. This
adds a command I've been running manually when this occurs. The command
will clear the stalled jobs before a CI run.

PR-URL: nodejs#11246
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com>
Reviewed-By: Gibson Fahnestock <gibfahn@gmail.com>
Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
Reviewed-By: Josh Gavant <josh.gavant@outlook.com>
@Trott
Copy link
Member Author

Trott commented Jun 18, 2017

@gibfahn #13754

gibfahn pushed a commit that referenced this pull request Jun 18, 2017
Sometimes, after a cluster or debug test fails, a fixture hangs around
and holds onto a needed port, causing subsequent CI runs to fail. This
adds a command I've been running manually when this occurs. The command
will clear the stalled jobs before a CI run.

PR-URL: #11246
Backport-PR-URL: #13754
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com>
Reviewed-By: Gibson Fahnestock <gibfahn@gmail.com>
Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
Reviewed-By: Josh Gavant <josh.gavant@outlook.com>
gibfahn pushed a commit that referenced this pull request Jun 20, 2017
Sometimes, after a cluster or debug test fails, a fixture hangs around
and holds onto a needed port, causing subsequent CI runs to fail. This
adds a command I've been running manually when this occurs. The command
will clear the stalled jobs before a CI run.

PR-URL: #11246
Backport-PR-URL: #13754
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com>
Reviewed-By: Gibson Fahnestock <gibfahn@gmail.com>
Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
Reviewed-By: Josh Gavant <josh.gavant@outlook.com>
MylesBorins pushed a commit that referenced this pull request Jul 11, 2017
Sometimes, after a cluster or debug test fails, a fixture hangs around
and holds onto a needed port, causing subsequent CI runs to fail. This
adds a command I've been running manually when this occurs. The command
will clear the stalled jobs before a CI run.

PR-URL: #11246
Backport-PR-URL: #13754
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com>
Reviewed-By: Gibson Fahnestock <gibfahn@gmail.com>
Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
Reviewed-By: Josh Gavant <josh.gavant@outlook.com>
@MylesBorins MylesBorins mentioned this pull request Jul 18, 2017
@Trott Trott deleted the terminate-stalled branch January 13, 2022 22:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Issues and PRs related to build files or the CI. test Issues and PRs related to the tests.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants