meta: PRs with infra failures can land #22439

mcollina · 2018-08-21T16:56:06Z

Amend the collaborator guide so that in case of severe infrastructure
failures we could still land PRs.

Checklist

commit message follows commit guidelines

Amend the collaborator guide so that in case of severe infrastructure failures we could still land PRs.

nodejs-github-bot · 2018-08-21T16:56:07Z

@mcollina build started: https://ci.nodejs.org/blue/organizations/jenkins/node-test-pull-request-lite-pipeline/detail/node-test-pull-request-lite-pipeline/614/pipeline

mcollina · 2018-08-21T17:14:52Z

cc @nodejs/tsc

mscdex · 2018-08-21T17:25:03Z

The current statement seems awfully vague. Does this mean every target platform/configuration crashed? Just one? Something else?

If every one of them crashed, to me it doesn't make sense to land the PR since it wasn't successful at all...

ofrobots · 2018-08-21T17:32:40Z

I think @nodejs/build and specially @Trott should review/sign off.

richardlau

I'm against this. If the infrastructure is causing issues with landing PRs we should fix the infrastructure, not ignore it.

mmarchini

The last CI run should be green before landing PRs. If there's an infra issue it should be fixed before landing new PRs.

mcollina · 2018-08-21T18:14:32Z

@richardlau @mmarchini I don't think @nodejs/build and @nodejs/tsc is willing to give all collaborators access to the machines, nor that all collaborators have the skills to do so. For example, as a @nodejs/tsc member I do not have access. So, I cannot fix this.

Let's consider https://ci.nodejs.org/job/node-test-commit-linux/20945/nodes=centos7-64-gcc6/console and #22439. According to the current rules, I cannot land this. However, this is ok to land.
Because of various issues, I tried to land three things today, and landed none after checking on the CI status every hour or so. Wasting people time is not something valuable, and it makes the collaborator experience awful.

The SLA that @nodejs/build (being a group of volunteers, like everybody else) can provide does not allow us to be this strict. @nodejs/build are our silent heroes, and this it's not their fault (they are awesome), but rather ours in asking such a high SLA.

The current statement seems awfully vague. Does this mean every target platform/configuration crashed? Just one? Something else?

@mscdex something between 1 and 5? Definitely not all. I'm happy to seek new formulations.

devsnek · 2018-08-21T18:25:05Z

in that case lets get more people on the build team

strong -1 to encouraging landing on red in any case, even if the pr doesn't cause the red

refack · 2018-08-21T18:54:11Z

I'm also -1. I don't think velocity is a trump card. I would favor stability and quality.
But since this is a "values" issue maybe we should vote. I'd be happy to have a Collaborator wide vote, but I do respect that the TSC has mandate to rule in this case.

refack · 2018-08-21T19:04:15Z

P.S.
In order reduce wasted time, the build WG has been prioritizing stability of the infra. We are also considering ways to improve the UX of our testing cluster.

Will proactive notification of infra failures help?
Will having push notification of CI run results help?

addaleax · 2018-08-21T20:52:54Z

Will proactive notification of infra failures help?

I think it would help to know whether CI is in a “usable” state or not?

mcollina · 2018-08-21T21:01:09Z

Will proactive notification of infra failures help?

Yes!
I think having a web pages/rss feed might be enough. Something like https://status.npmjs.org/ or https://status.github.com/messages.

Will having push notification of CI run results help?

Yes.

devsnek · 2018-08-21T21:05:19Z

shoutout to https://nodejs-ci-health.mmarchini.me/

mcollina · 2018-08-21T21:21:31Z

@devsnek that’s great work (and we should link it somewhere), but it does not tell me if those 60% failures are because there are spurious tests failing, tests rightfully failing or because infra is having trouble. It does not answer the question “is CI usable now”.

refack · 2018-08-21T21:50:46Z

Will proactive notification of infra failures help?

Yes!

ATM we have a manually updated project board https://github.com/nodejs/build/projects/1 the first column lists thing the Build WG is aware of.
Several WG members are working multiple ways to automate this (H/T to @joyeecheung, @mmarchini and @maclover7)

Will having push notification of CI run results help?

Yes.

I'll prioritize that task.

Trott

I sympathize with the frustration @mcollina is having. Still, I'm a -1 for all the same reasons others have expressed and for the reasons at https://medium.com/@Trott/on-landing-code-when-ci-fails-f3aa999cda3d. Any loophole to landing code without a full CI run isn't something I'm likely to sign off on. That said, here's an alternate solution for @mcollina and others:

One workaround that is similar to what is suggested here but has a slightly higher bar is to request that Build WG remove the broken host or even entire platform/OS from node-test-pull-request. I don't want to document that as the recommended way to do things when you run into CI difficulties, but for the handful of highly-active folks landing multiple pull requests a day: When reasonably certain that a problem is unrelenting and unrelated, it's an option. I slightly prefer it to what is suggested in this PR because it improves things for everyone because CI will be green again. It helps build trust in CI results rather than perpetuating distrust in CI results. Downside is that it's on Build WG to not forget to fix the host/platform and get it back online, but they're pretty great about that as far as I know.

joyeecheung · 2018-08-22T12:18:47Z

that’s great work (and we should link it somewhere), but it does not tell me if those 60% failures are because there are spurious tests failing, tests rightfully failing or because infra is having trouble. It does not answer the question “is CI usable now”.

@mcollina (Shameless plug here) ncu-ci walk pr --stats --copy now generates a markdown like the OP in nodejs/reliability#15 (to clipboard, the CLI prints a more succinct summary to the terminal), which aggregates all the failures in the last n CI runs, classifies them into test failures and jenkins/git/build failures and picks out ones that fail more than 1 PRs (that's a pretty reliable indication of CI issues). (--stats has not been released to npm yet, but it's on master). One flow I've been trying lately is:

Run ncu-ci walk pr --stats --copy --limit 50 every three or more days
Paste the results in https://github.com/nodejs/reliability/issues/ and track the progress of these failures, give them a little push (e.g. ask around in the build IRC) if necessary

The essence of the command is the pattern database in https://github.com/nodejs/node-core-utils/blob/master/lib/ci/ci_failure_parser.js which I try to improve based on errors I find every time I run the command. It guesses the root cause wrong sometimes but I think it's usually smarter than me when I look at the logs ;) And as far as aggregation goes it is good enough.

We can improve it to associate issue links with failures and probably auto-update https://github.com/nodejs/build/projects/1 but it will take some work to make it smart enough to figure out what to do...

mhdawson

I'm also -1, I understand the frustration of not being able to land a PR in the time slot you have available, but I feel it is more important to prioritize confidence in our what has landed through green builds as opposed to allowing things to land when things are broken.

mcollina · 2018-08-23T14:41:47Z

Closing this for now, let's see how things go and reassess in the future.

meta: PRs with infra failures can land

38e4d4d

Amend the collaborator guide so that in case of severe infrastructure failures we could still land PRs.

mcollina requested a review from Trott August 21, 2018 16:56

nodejs-github-bot added the doc Issues and PRs related to the documentations. label Aug 21, 2018

BridgeAR approved these changes Aug 21, 2018

View reviewed changes

jasnell approved these changes Aug 21, 2018

View reviewed changes

richardlau requested changes Aug 21, 2018

View reviewed changes

mmarchini suggested changes Aug 21, 2018

View reviewed changes

refack added build Issues and PRs related to build files or the CI. meta Issues and PRs related to the general management of the project. labels Aug 21, 2018

Trott requested changes Aug 22, 2018

View reviewed changes

mhdawson requested changes Aug 23, 2018

View reviewed changes

mcollina closed this Aug 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

meta: PRs with infra failures can land #22439

meta: PRs with infra failures can land #22439

mcollina commented Aug 21, 2018

nodejs-github-bot commented Aug 21, 2018

mcollina commented Aug 21, 2018

mscdex commented Aug 21, 2018 •

edited

Loading

ofrobots commented Aug 21, 2018

richardlau left a comment

mmarchini left a comment

mcollina commented Aug 21, 2018

devsnek commented Aug 21, 2018

refack commented Aug 21, 2018

refack commented Aug 21, 2018

addaleax commented Aug 21, 2018

mcollina commented Aug 21, 2018

devsnek commented Aug 21, 2018

mcollina commented Aug 21, 2018

refack commented Aug 21, 2018 •

edited

Loading

Trott left a comment •

edited

Loading

joyeecheung commented Aug 22, 2018 •

edited

Loading

mhdawson left a comment

mcollina commented Aug 23, 2018

meta: PRs with infra failures can land #22439

meta: PRs with infra failures can land #22439

Conversation

mcollina commented Aug 21, 2018

Checklist

nodejs-github-bot commented Aug 21, 2018

mcollina commented Aug 21, 2018

mscdex commented Aug 21, 2018 • edited Loading

ofrobots commented Aug 21, 2018

richardlau left a comment

Choose a reason for hiding this comment

mmarchini left a comment

Choose a reason for hiding this comment

mcollina commented Aug 21, 2018

devsnek commented Aug 21, 2018

refack commented Aug 21, 2018

refack commented Aug 21, 2018

addaleax commented Aug 21, 2018

mcollina commented Aug 21, 2018

devsnek commented Aug 21, 2018

mcollina commented Aug 21, 2018

refack commented Aug 21, 2018 • edited Loading

Trott left a comment • edited Loading

Choose a reason for hiding this comment

joyeecheung commented Aug 22, 2018 • edited Loading

mhdawson left a comment

Choose a reason for hiding this comment

mcollina commented Aug 23, 2018

mscdex commented Aug 21, 2018 •

edited

Loading

refack commented Aug 21, 2018 •

edited

Loading

Trott left a comment •

edited

Loading

joyeecheung commented Aug 22, 2018 •

edited

Loading