Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A strategy for evolving the pkgs/by-name CI checks #256788

Closed
infinisil opened this issue Sep 23, 2023 · 10 comments
Closed

A strategy for evolving the pkgs/by-name CI checks #256788

infinisil opened this issue Sep 23, 2023 · 10 comments
Milestone

Comments

@infinisil
Copy link
Member

infinisil commented Sep 23, 2023

Context

The pkgs/by-name CI check gets the nixpkgs-check-by-name tooling, which lives in Nixpkgs itself, from the latest NixOS channel. This makes CI very fast and predictable for all PRs, because it's able to re-use the pre-built tooling from Hydra.

Problem

However, we have a problem if we want to change the tool:
Say we increase the strictness of the tool with a PR, while fixing all the problems it newly detects in the same PR. But now we have to wait for perhaps days until the NixOS channel updates until the new tool is used in CI. In that timespan, new problems could've been introduced without being detected.

Proposed solution

To solve this I propose to temporarily adjust CI for every strictness increase in the tool as follows:

  • In addition to the latest NixOS channel version of the tool, also use a version that is pinned to the then-latest NixOS channel revision at the time of the tool change
  • The two versions of the tool are then used to determine whether the check should succeed or not as follows:
    • If the latest tool succeeds on the base branch of the PR, it must also succeed for the PR.

      This indicates that the pinned tooling isn't necessary anymore, a message is traced that the pin can get removed

    • Otherwise, if the pinned tool succeeds on the base branch, it must also succeed for the PR.

      This indicates that the base branch needs to be fixed for the new tooling. The logs will contain the failures of the latest tool.

    • Otherwise, either the pinned or the latest tool must succeed for the PR

      This indicates that the base branch is broken, either due to checks of a PR being ignored, or the PR being merged after the checks have changed.

      In this state we don't know whether the base branch already succeeded with the latest tool, so a PR can pass if it fixes the breakage using either version.

Once the channel updates the next time after the tooling update is merged, another PR can be made to fix any remaining problems. Repeat for some time until no new problems are introduced from PR's that were merged without running CI again.

This could also benefit from some automation to rerun PR checks if it's been say 1 week, which could then be used as the time window to be sure of no PR's still having old checks.

And in a final PR, once the base branch is definitely working with the new check, the temporary pinned version can be disabled again, only using the latest NixOS channel version of the tool once more.


Having thought through this, I think it's about as smooth as it can get, and it sounds generally useful for all CI changes.

I'd love to hear if there are other proposals to handle this though. In the end I think we need something like this for RFC 140, because we'll have a lot of PR's affected.

@infinisil
Copy link
Member Author

infinisil commented Sep 27, 2023

It was pointed out that the above text is really hard to digest. So here's my attempt at a more visual explanation.

With only a single check

When there's only one check, Nixpkgs can only be in two states, pretty simple:

Copy of Untitled Diagram drawio

Here are the Nixpkgs state transitions from the (latest) base commit to the merge commit that should pass in CI:

stateDiagram-v2
    invalid: Check fails
    old: Check succeeds

    invalid --> old
    old --> old
Loading

We can only get back to the "Check fails" state if a PR gets merged even if CI fails.
And once there, only a PR that fixes the check will pass CI.

With a new stricter check

With a new stricter check, Nixpkgs can now be in three states:

Untitled Diagram drawio(3)

Here are the Nixpkgs state transitions from the (latest) base commit to the merge commit that should pass in CI:

stateDiagram-v2
    invalid: Both checks fail
    old: Old check succeeds
    new: New check succeeds

    invalid --> old
    invalid --> new
    old --> old
    new --> new
    old --> new
Loading

Noteworthy here is that:

  • If a CI check runs for a PR where the base branch already passes the stricter check, the PR must also pass that check.
  • It's only possible to go from "New check succeeds" to "Old check succeeds" from merging a PR that wasn't updated since the new stricter check was introduced
  • To go from "Old check succeeds" to "New check succeeds", manual work is required to regularly check master for some time, fixing it with the new checks as old PR's get merged. The goal is for the base branch to succeeding with the new checks 100% of the time, so that any PR's based on it will also be using the new check.
Expand for a part that's not necessary However, writing down that last point, it's a bit too manual imo. Maybe that can be automated by: - Running this CI check not only for PRs, but also for the master branch. If the new stricter check fails (while the old one succeeds), automatically create an issue that pings me. I'll see this and create a PR that fixes it. - After such PR's are merged, and once we've gone say a week without any new automatic issues created, it means that no PR's breaking the new checks have been merged for a week. - At this point we can get rid of the old check and go back to the simpler single check scenario

@infinisil
Copy link
Member Author

infinisil commented Sep 27, 2023

I'm conflating two related issues here though:

  • CI not triggering again as stricter tooling is introduced on the base branch, therefore breaking master if they're merged
    • This is ideally fixed by just re-triggering that CI check for all old PRs.
    • The above approach may be more suitable if it's too expensive to do that however.
  • The nixpkgs-check-by-name tooling being used from a channel, so updates to it are delayed
    • Only this seems to require an approach as above, but only with a single PR to fix the new checks for the base branch, and only once the channel updates.
    • At that point, a CI re-trigger should also be done.

I think I'll have to write this down again..

infinisil added a commit to tweag/nixpkgs that referenced this issue Sep 27, 2023
Previously, even if the check also failed on the base branch, it looked
like the PR introduced the failure.

We can easily have a better error message for such cases.

Meanwhile this also paves the road for something like
NixOS#256788
@infinisil
Copy link
Member Author

An independent initial step towards this: #257735

infinisil added a commit to tweag/nixpkgs that referenced this issue Sep 27, 2023
Previously, even if the check also failed on the base branch, it looked
like the PR introduced the failure.

We can easily have a better error message for such cases.

Meanwhile this also paves the road for something like
NixOS#256788
infinisil added a commit to tweag/nixpkgs that referenced this issue Sep 27, 2023
Previously, even if the check also failed on the base branch, it looked
like the PR introduced the failure.

We can easily have a better error message for such cases.

Meanwhile this also paves the road for something like
NixOS#256788
infinisil added a commit to tweag/nixpkgs that referenced this issue Sep 27, 2023
Previously, even if the check also failed on the base branch, it looked
like the PR introduced the failure.

We can easily have a better error message for such cases.

Meanwhile this also paves the road for something like
NixOS#256788
infinisil added a commit to tweag/nixpkgs that referenced this issue Sep 27, 2023
Previously, even if the check also failed on the base branch, it looked
like the PR introduced the failure.

We can easily have a better error message for such cases.

Meanwhile this also paves the road for something like
NixOS#256788
infinisil added a commit to tweag/nixpkgs that referenced this issue Sep 27, 2023
Previously, even if the check also failed on the base branch, it looked
like the PR introduced the failure.

We can easily have a better error message for such cases.

Meanwhile this also paves the road for something like
NixOS#256788
infinisil added a commit to tweag/nixpkgs that referenced this issue Oct 2, 2023
Previously, even if the check also failed on the base branch, it looked
like the PR introduced the failure.

We can easily have a better error message for such cases.

Meanwhile this also paves the road for something like
NixOS#256788
infinisil added a commit to tweag/nixpkgs that referenced this issue Oct 2, 2023
Previously, even if the check also failed on the base branch, it looked
like the PR introduced the failure.

We can easily have a better error message for such cases.

Meanwhile this also paves the road for something like
NixOS#256788
infinisil added a commit to tweag/nixpkgs that referenced this issue Oct 2, 2023
Previously, even if the check also failed on the base branch, it looked
like the PR introduced the failure.

We can easily have a better error message for such cases.

Meanwhile this also paves the road for something like
NixOS#256788
infinisil added a commit to tweag/nixpkgs that referenced this issue Oct 2, 2023
Previously, even if the check also failed on the base branch, it looked
like the PR introduced the failure.

We can easily have a better error message for such cases.

Meanwhile this also paves the road for something like
NixOS#256788
infinisil added a commit to tweag/nixpkgs that referenced this issue Oct 2, 2023
Previously, even if the check also failed on the base branch, it looked
like the PR introduced the failure.

We can easily have a better error message for such cases.

Meanwhile this also paves the road for something like
NixOS#256788
infinisil added a commit to tweag/nixpkgs that referenced this issue Oct 2, 2023
Previously, even if the check also failed on the base branch, it looked
like the PR introduced the failure.

We can easily have a better error message for such cases.

Meanwhile this also paves the road for something like
NixOS#256788
infinisil added a commit to tweag/nixpkgs that referenced this issue Oct 2, 2023
Previously, even if the check also failed on the base branch, it looked
like the PR introduced the failure.

We can easily have a better error message for such cases.

Meanwhile this also paves the road for something like
NixOS#256788
infinisil added a commit to tweag/nixpkgs that referenced this issue Oct 2, 2023
Previously, even if the check also failed on the base branch, it looked
like the PR introduced the failure.

We can easily have a better error message for such cases.

Meanwhile this also paves the road for something like
NixOS#256788
@infinisil
Copy link
Member Author

infinisil commented Oct 12, 2023

Alternate approach is to add --version <VERSION> to the tool. This way there's no need to have a pinned version of the tool, we can always use the latest one, but just run it twice (once with the previous version, once with the new one) to achieve the same effect

I'm adding such a --version in #256792 now

@infinisil
Copy link
Member Author

infinisil commented Oct 13, 2023

This doesn't really work how I imagined it after all, because after testing this in a separate repo I discovered that it's impossible to re-trigger workflows for older PR's with an updated workflow file, GitHub always uses the same workflow file. So all of these "change the workflow file to run the tool twice" don't really work.

Instead my new plan is to:

  • Pin the CI tooling to a channel release in the base branch
  • Update the workflow to call the pinned tooling from the base branch (this workflow hopefully never needs to be updated again..)
  • Whenever it's time to use a new stricter tool, update the pin and fix the base branch in a single PR
  • When the pinned version in the base branch updates, automatically re-trigger the pkgs/by-name check for all open PRs that change pkgs/top-level/all-packages.nix or pkgs/by-name.

This means that updates to nixpkgs-check-by-name do not automatically end up being used by CI. Instead the pinned version in the base branch determines the version, and that can only update once the tooling is built by a channel.

@infinisil
Copy link
Member Author

Okay change of plans again, the above is way too messy and doesn't catch everything. Instead I'm thinking now of going for this:

  • Pass both the base branch version and the merged PR version of Nixpkgs to nixpkgs-check-by-name
  • The tool fails only if the PR violates any checks that passed on the base branch.

This way, there will be a gradual migration for every new check, because all new (and updated) PRs won't be able to make it worse.

@thufschmitt
Copy link
Member

I don't know whether it's a viable solution, but I would advocate for considering using some kind of merge train (either GitHub's native merge queues or one of the many bots that do that). It looks like it's the “obvious” solution to the problem of having PRs pass the CI but break master once they are merged.

@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/2023-10-31-nixpkgs-architecture-team-meeting-45/34846/1

@infinisil
Copy link
Member Author

I now opened #272395 where I use the approach from #256788 (comment). The changes from #261939 made this fairly straightforward :)

@infinisil
Copy link
Member Author

I now consider this done with #272395!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants