Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

meta: PRs with infra failures can land #22439

Closed

Conversation

mcollina
Copy link
Member

Amend the collaborator guide so that in case of severe infrastructure
failures we could still land PRs.

Checklist

Amend the collaborator guide so that in case of severe infrastructure
failures we could still land PRs.
@mcollina mcollina requested a review from Trott August 21, 2018 16:56
@nodejs-github-bot nodejs-github-bot added the doc Issues and PRs related to the documentations. label Aug 21, 2018
@mcollina
Copy link
Member Author

cc @nodejs/tsc

@mscdex
Copy link
Contributor

mscdex commented Aug 21, 2018

The current statement seems awfully vague. Does this mean every target platform/configuration crashed? Just one? Something else?

If every one of them crashed, to me it doesn't make sense to land the PR since it wasn't successful at all...

@ofrobots
Copy link
Contributor

I think @nodejs/build and specially @Trott should review/sign off.

Copy link
Member

@richardlau richardlau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm against this. If the infrastructure is causing issues with landing PRs we should fix the infrastructure, not ignore it.

Copy link
Contributor

@mmarchini mmarchini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last CI run should be green before landing PRs. If there's an infra issue it should be fixed before landing new PRs.

@mcollina
Copy link
Member Author

@richardlau @mmarchini I don't think @nodejs/build and @nodejs/tsc is willing to give all collaborators access to the machines, nor that all collaborators have the skills to do so. For example, as a @nodejs/tsc member I do not have access. So, I cannot fix this.

Let's consider https://ci.nodejs.org/job/node-test-commit-linux/20945/nodes=centos7-64-gcc6/console and #22439. According to the current rules, I cannot land this. However, this is ok to land.
Because of various issues, I tried to land three things today, and landed none after checking on the CI status every hour or so. Wasting people time is not something valuable, and it makes the collaborator experience awful.

The SLA that @nodejs/build (being a group of volunteers, like everybody else) can provide does not allow us to be this strict. @nodejs/build are our silent heroes, and this it's not their fault (they are awesome), but rather ours in asking such a high SLA.

The current statement seems awfully vague. Does this mean every target platform/configuration crashed? Just one? Something else?

@mscdex something between 1 and 5? Definitely not all. I'm happy to seek new formulations.

@devsnek
Copy link
Member

devsnek commented Aug 21, 2018

in that case lets get more people on the build team

strong -1 to encouraging landing on red in any case, even if the pr doesn't cause the red

@refack
Copy link
Contributor

refack commented Aug 21, 2018

I'm also -1. I don't think velocity is a trump card. I would favor stability and quality.
But since this is a "values" issue maybe we should vote. I'd be happy to have a Collaborator wide vote, but I do respect that the TSC has mandate to rule in this case.

@refack
Copy link
Contributor

refack commented Aug 21, 2018

P.S.
In order reduce wasted time, the build WG has been prioritizing stability of the infra. We are also considering ways to improve the UX of our testing cluster.

Will proactive notification of infra failures help?
Will having push notification of CI run results help?

@refack refack added build Issues and PRs related to build files or the CI. meta Issues and PRs related to the general management of the project. labels Aug 21, 2018
@addaleax
Copy link
Member

Will proactive notification of infra failures help?

I think it would help to know whether CI is in a “usable” state or not?

@mcollina
Copy link
Member Author

Will proactive notification of infra failures help?

Yes!
I think having a web pages/rss feed might be enough. Something like https://status.npmjs.org/ or https://status.github.com/messages.

Will having push notification of CI run results help?

Yes.

@devsnek
Copy link
Member

devsnek commented Aug 21, 2018

shoutout to https://nodejs-ci-health.mmarchini.me/

@mcollina
Copy link
Member Author

@devsnek that’s great work (and we should link it somewhere), but it does not tell me if those 60% failures are because there are spurious tests failing, tests rightfully failing or because infra is having trouble. It does not answer the question “is CI usable now”.

@refack
Copy link
Contributor

refack commented Aug 21, 2018

Will proactive notification of infra failures help?

Yes!

ATM we have a manually updated project board https://github.com/nodejs/build/projects/1 the first column lists thing the Build WG is aware of.
Several WG members are working multiple ways to automate this (H/T to @joyeecheung, @mmarchini and @maclover7)

Will having push notification of CI run results help?

Yes.

I'll prioritize that task.

Copy link
Member

@Trott Trott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sympathize with the frustration @mcollina is having. Still, I'm a -1 for all the same reasons others have expressed and for the reasons at https://medium.com/@Trott/on-landing-code-when-ci-fails-f3aa999cda3d. Any loophole to landing code without a full CI run isn't something I'm likely to sign off on. That said, here's an alternate solution for @mcollina and others:

One workaround that is similar to what is suggested here but has a slightly higher bar is to request that Build WG remove the broken host or even entire platform/OS from node-test-pull-request. I don't want to document that as the recommended way to do things when you run into CI difficulties, but for the handful of highly-active folks landing multiple pull requests a day: When reasonably certain that a problem is unrelenting and unrelated, it's an option. I slightly prefer it to what is suggested in this PR because it improves things for everyone because CI will be green again. It helps build trust in CI results rather than perpetuating distrust in CI results. Downside is that it's on Build WG to not forget to fix the host/platform and get it back online, but they're pretty great about that as far as I know.

@joyeecheung
Copy link
Member

joyeecheung commented Aug 22, 2018

that’s great work (and we should link it somewhere), but it does not tell me if those 60% failures are because there are spurious tests failing, tests rightfully failing or because infra is having trouble. It does not answer the question “is CI usable now”.

@mcollina (Shameless plug here) ncu-ci walk pr --stats --copy now generates a markdown like the OP in nodejs/reliability#15 (to clipboard, the CLI prints a more succinct summary to the terminal), which aggregates all the failures in the last n CI runs, classifies them into test failures and jenkins/git/build failures and picks out ones that fail more than 1 PRs (that's a pretty reliable indication of CI issues). (--stats has not been released to npm yet, but it's on master). One flow I've been trying lately is:

  1. Run ncu-ci walk pr --stats --copy --limit 50 every three or more days
  2. Paste the results in https://github.com/nodejs/reliability/issues/ and track the progress of these failures, give them a little push (e.g. ask around in the build IRC) if necessary

The essence of the command is the pattern database in https://github.com/nodejs/node-core-utils/blob/master/lib/ci/ci_failure_parser.js which I try to improve based on errors I find every time I run the command. It guesses the root cause wrong sometimes but I think it's usually smarter than me when I look at the logs ;) And as far as aggregation goes it is good enough.

We can improve it to associate issue links with failures and probably auto-update https://github.com/nodejs/build/projects/1 but it will take some work to make it smart enough to figure out what to do...

Copy link
Member

@mhdawson mhdawson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also -1, I understand the frustration of not being able to land a PR in the time slot you have available, but I feel it is more important to prioritize confidence in our what has landed through green builds as opposed to allowing things to land when things are broken.

@mcollina
Copy link
Member Author

Closing this for now, let's see how things go and reassess in the future.

@mcollina mcollina closed this Aug 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Issues and PRs related to build files or the CI. doc Issues and PRs related to the documentations. meta Issues and PRs related to the general management of the project.
Projects
None yet
Development

Successfully merging this pull request may close these issues.