-
-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve CI permissions
, auto-merge maintainability, and clarity
#1668
Conversation
In GitHub Actions workflow files, this separates top-level keys within a job with blank lines, which was sometimes but not usually already done, since this seems to improve readability. Besides that, this also applies the prevailing style more consistently. This style tweak is in preparation for adding explicit `permissions` keys (so that doing so won't decrease readability).
This adds `permissions` keys at workflow or job level in a number of places, usually specifying minimal required permissions. When a `permissions` key contains at least one subkey setting some permission (in any way), all permissions it does not have subkeys for are treated as if they were present with a value of `none`. This relies on that rather than listing all unneeded permissions everywhere with `none`. So most `permissions` added here have only `contents: read`, and: - The only place where `none` is specified explicitly is in the CIFuzz workflow, where no permissions (of those controllable through `permissions`) are needed. - The only place any `write` permissions are specified are `contents: write` in the jobs of the release workflow that need it. All jobs involved in preparing a release currently have at least one step that requires this. But `contents: read` is still applied at the workflow level, because the `installation` job (which tests installing) does not need any `write` permissions. Note that some jobs that don't have any write permissions of the kind that is controlled under a `permissions` key do still perform writes to data accessible outside of their run: caching (including creating new caches), and uploading artifacts, still works. Relevant documentation: - https://docs.github.com/en/actions/writing-workflows/workflow-syntax-for-github-actions#permissions - https://docs.github.com/en/actions/writing-workflows/workflow-syntax-for-github-actions#jobsjob_idpermissions - https://docs.github.com/en/rest/authentication/permissions-required-for-github-apps
This makes more stylistic adjustments to YAML workflows files. (Previous recent adjustments were, for the most part, only those needed to ensure clarity would be preserved while adding `permissions:`. This commit does some others that weren't needed for that.) The changes made here are: - Use `actions/checkout@v4`, not `actions/checkout@master`, in the one place where `@master` was still used for it. This is not strictly speaking a purely stylistic change, since it is eventually expected to make a difference, by holding onto major version 4 until the version is updated (which Depedabot automates). - Don't explicitly set the `CI` environment variable in the `test` job. This variable is guaranteed to be set automatically with the value of `true` by the GitHub Actions runner. Furthermore, no environment variables to indicate CI are explicitly set for other jobs, even though others rely on their presence as well due to the use of `is_ci` through `gix-testtools`. - Specify all environment variables' values as strings. Most values in YAML are strings without explicit quoting. This does not add quotation marks in those cases. But some values are parsed as some other type, such as integer, unless quoted. That makes it less obvious what the value will actually be in the environment, since it will be implicitly converted to a string, which does not always consist of the same sequence of characters as the original expression. This effect is most prominent for booleans (e.g. unquoted `yes` and `true` in YAML have the same meaning) but not entirely limited to them. In addition, such expressions may also lead to confusion if it is misread to mean that they will retain any kind of type information. So this quotes such values (except when other changes here remove them). - Minor quoting style adjustments, for YAML strings are quoted. Omit quotes when clearly not needed, and use single-quotes by default when quotes are needed. - Use 2-space indent inside scripts in script steps for consistency with other scripts. Most existing multi-line script steps, as well as all shell scripts, use 2-space indents already. - Remove `\` in a script step where it was not needed for line continuation because it followed a `|`. - In the `wasm` job's script steps, put `name:` and, when present, `if:`, ahead of `run:`, so it's clear what each step is for. (This was already done where applicable in other jobs.) - In the `wasm` job's script steps, split single-line `run:` values with `&&` into separate commands, one per line, where doing so has the same effect due to the `-e` option the CI runner automatically passes to `bash` shells unless `shell:` is overridden with a value that contains `{0}`. - In the `wasm` job's script steps, split single-line `run:` values with `;` into separate lines, including running loops over multiple lines, for readability. - In the `wasm` job, extract `${{ matrix.target }}` to an environment variable. This seems to make the steps slightly more readable, since `wasm` steps make heavy use of it. - Extremely minor adjustment to array style, for consistency. - In commands, use `--` where guaranteed to be supported if any non-option argument begins with parameter expansion or a glob. This was already almost, but not quite, already done. Since the possible values can be inferred from nearby code and do not begin with `-`, the reason is to clearly communicate that they are non-option arguments, rather than to fix any actual bug. (This continues to be avoided where is not guaranteed correct.) - Use the `-e` option before a pattern argument to `grep` formed through parameter expansion (same rationale as `--` above). - Use `-E` rather than `-r` with `sed`, since `-E` is standardized.
In the past (GitoxideLabs#143), Dependabot version updates had been used to keep `cargo` dependencies up to date. This was removed in favor of doing manual updates based on automatic reports from `cargo deny` and the old `dependabot.yml` was kept but renamed to disable it and point people to GitoxideLabs#144 to learn about the change. Since then, Dependabot security updates, which are distinct from Dependabot version updates, were enabled (see GitoxideLabs#1254), and later, Dependabot version updates were reintroduced for GitHub Actions only (GitoxideLabs#1357). At that point, there were two Dependabot-related YAML files: the old disabled one, and the new one for GHA. This removes the old one, explaining the situation in a comment in the new one, including a link to GitoxideLabs#144. While doing so, this also adjusts the YAML code style there, to bring it in line with the style of most other YAML files in the repository.
This gives `name:` keys to the `test-pass` steps, turning the first step's comment into its name. This way, the output can be as clear as the workflow file itself.
This intentionally fails right now, by omitting `tests-pass` itself as a job that `tests-pass` should not depend on, in order to ensure that it is able to fail. Once this is observed (and any other bugs fixed), this omission should be corrected, and then it should pass.
- Show the job lists for better debugging. - Reword the `diff` step name to be clearer (albeit less precise).
Changes: - Bump the version of `cargo-diet` used in the `lint` job from 1.2.4 to 1.2.7. - Use the version tag not just as an operand to `--tag`, but also as the ref from which the installation script itself is obtained. (The rationale is that the effect of skew here would probably be unintuitive, and also that we had been specifying `master` but the default branch of the `cargo-diet` repository is now `main`.) - Instead of using `|| true` on the step that runs `just check-size` (which needs cargo-diet), split installing and running into separate steps, and mark the running step `continue-on-error`.
The WebAssembly CI job definition had an unconditional job-level `continue-on-error` set to `true`, so that it would always report a successful conclusion, even on failure. This made sense before PR auto-merge was set up. But these jobs do not typically fail, and their failures should be more apparent, even if they should not yet block PRs from being auto-merged. This commit changes the `wasm` jobs so they are able to fail and cause the workflow as a whole to have a failing status, while still not blocking PR auto-merge. `continue-on-error` is removed from the job definition, but it is also no longer made a dependency of the required `tests-pass` check.
98f2c3d
to
5173e9a
Compare
I didn't expect that updating dependencies in I had totally missed that this could already be fixed easily, because I didn't realize that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for this massive improvement! I can't wait to see it merged and adjust the auto-merge configuration to listen to tests-pass
instead of a long list of individual jobs.
Thanks a lot for the hint the additional reduction of risk. I assume that this repository setting is only doable if permissions are set correctly on a PR level, so I am glad this is possible now.
This is a simple job definition, which could be broken up into two non-matrix job definitions with minimal code duplication. It seems to me that this should be done, so that a new advisory causes the job to have a failing conclusion and fail the workflow even as it does not block PRs. I had originally planned to include that in this PR, but I decided not to do so, since the question of whether to do shoud probably be considered separately.
I agree, let's merge this and have the WASM-dematrixification (if that's a word ;)) in a separate PR. There has been some interest in continuing the development of WASM-support lately so improving CI in that regard will certainly be helpful to them as well.
I didn't expect that updating dependencies in
Cargo.lock
(5173e9a) would be able to fix thecargo deny advisories
error about depending on the unmaintainedinstant
crate yet, but it looks like it has. Thanks!
No worries at all - at this time I didn't even read up to your assessment of the situation yet and just responded with my standard reflex in these situations 😁. Luckily, it worked.
You give me too much credit, because the improvements here are not needed for that, though I believe they do make it easier to keep it working properly. Edit: Yes, this was reported as done in #1551 (comment). #1551 did not make it possible for |
CI failed here on 5173e9a with a very strange error. It can be examined by expanding the annotations shown at the top of that page. The error reports that the GHA runner machine was out of disk space. Although it's possible for a new dependency to have a bug that causes massively increased disk space usage, I think that's probably not the cause, because the error does not happen for that same commit on my fork. My guess is that it might go away if the workflow is re-run, which I recommend trying. Edit: Also, the error is preceded by a warning about low disk space, and the warning seems to have been generated prior to any toolchains or dependencies being installed, so the changes in 5173e9a are unlikely to have triggered it. Further edit: Unfortunately that appears not to be the case. The error does happen again in 38edb2c (#1670), though again it happens only in the It also passes in my fork on a fork-internal |
The problem seems to happen only on the upstream repository, or maybe only on PRs from forks. This initially makes no changes, to verify that the problem does not simply happen regardless of content. See GitoxideLabs#1668, and in particular GitoxideLabs#1668 (comment), as well as #3, and #4, for details.
Oh, I see, thanks for the clarification. I was on an entirely different page there apparently. No space left on deviceFor posterity, here is the way the runner itself crashes:
So it looks like writing a trace to a log fails during normal operation, and that even before anything happens. The cache is pretty big, but no reason to fail this soon. On the current try, it gets to running the tests at least and I will keep watching it - maybe it works now and all this was no more than a bad (and transient) dream. |
It really does fail when trying to compile:
And then the log gets truncated so when looking at it again, it seems like not much happened. Let me try something…. |
That way, the runner hopefully doesn't run out of disk space while it is compiling the various targets used in a journey test.
Great catch--I had at some point wondered why nothing ever called The #1671 test PR here passes initially, and then fails after cherry-picking 5173e9a from here. So the increased space usage is triggered by upgrading the crates, somehow. Maybe this is due to dependencies being heavier, but my guess is that it is instead due to there being more stuff that has to be generated rather than obtained from the GitHub Actions cache. Alternatively, maybe the problem is that it retrieves data from the cache and saves them, and the size of those data, some of which can't be used, combined with the files that are generated, is too big. |
🎉 That worked! 🎉 |
Yes, I think it's that! Now the cache is basically useless, yet it consumes a lot of space. Alternatively, I could have cleared it, probably lovering the footprint significantly. Now I deleted all caches, 35GB of 10GB available, way above the limit, and think that it might now work with the extra clearing which can cut down the build time a little, presumably. Also, I think caches will become more inefficient over time as also Cargo doesn't delete old caches, so keeps various versions of the same crate around, each time it changes. At least…so I think, judging by the continuously growing crashes on my disk. |
These didn't make it into GitoxideLabs#1668. Besides comments, the changes are for consistency with the prevailing style, usually by omitting redundant YAML quoting. Removal of the outer double quotes in the `if` in `tests-pass` is a case of this, and produces an equivalent node in parsing (i.e. its equivalence does not depend on anything about GHA itself). But just to be sure, I did run yq '.jobs.tests-pass.steps[0].if' .github/workflows/ci.yml before and after the change, to ensure the output was the same. The other change here that deserves comment is the removal of `--` as an argument to a `diff` command. When any path argument is formed from paramter expansion or from a glob with a leading `*` or other globbing character, `--` helps express that the following arguments are not options. For `git diff`, a `--` expresses that the following arguments are neither options nor refs, but paths, so all `git diff` commands with paths in the CI workflows use `--` even if no shell expansions are involved. (In practice this means `--` is often useful for `diff` with paths and, based on this habit, I had inadvertently written a `--` where neither of the above scenarios applied. But that had actually decreased stylistic consistency because we are not using `--` elsewhere that the meaning of all arguments after it is unambiguous even without examining any surrounding context.)
Why is it useless? Is this because, since 132696d, the data that make it worthwhile are now deleted in the
Maybe that is why I was unable to reproduce the problem internally to my fork, even when testing with both
If the files where this happens are in the Another scheduled workflow could be added that clears all I wonder if the reason the cache was so large here in this upstream repository is that, separately from any recent changes, different branches (and different PRs) have their own caches to keep them from poisoning one another's caches, which would sometimes be a correctness problem and, for PRs from forks, a security problem. If I understand the documentation about this correctly, caches from It is possible for different jobs to share caches, and maybe there are some cases where that would make sense here. In particular, the |
Thanks for picking this up. I was blunt with words, and think "inefficient" would have been better. It's probably never useless, it is just reduced in value to the point where it might cost more to maintain the cache than the time it saves.
While looking up the docs I did realize that I knew nothing at all about how it works - it's much more adapted to the task than I thought. Here is an excerpt:
If that environment variable actually carries over, then the worst offender of huge
Making Windows jobs faster would of course be great, but I wouldn't push it if |
I certainly have no objection to characterizing caching as useless when its goal is to decrease overall load or running time but instead it increases it. I would also call that useless. (The software used to do the caching can of course still be useful in many, even most, other cases, where it does provide savings.) So it didn't seem to me that you were too blunt at all. Instead, I wanted to understand the source of the inefficiency, as well as figure out if there are any current uses of the As it currently stands (with 132696d), the benefit of caching build artifacts in the CI
This is to say that we cache artifacts for If the foregoing analysis is correct, then we should try out at least one of these changes in the hope that it will improve running time:
The second approach is elegant in that it splits the jobs where they are effectively almost being split by the running of But I think it's worth trying, so I'll open a PR for it. The running time of the CI
Fortunately, it does.
Is this due to the effect of builds being nondeterministic, or something else?
In that case, I will hold off, for now, on attempting to cause different jobs to share caches.
They are generous in the sense that new cache entries are created, and in that old entries are sometimes dropped gradually rather than immediately. But my understanding of the documentation is that, while it's fine for the cache to grow bigger than 10 GB from time to time, this could significantly worsen performance when it happens frequently:
|
I think on top of that is the issue of having so many feature toggles, which make reuse of workspace dependencies (i.e.
That's exciting, let's try that!
It's deterministic, at least that's a goal, but what I am suggesting is that each code change will cause a new file to be created. The build output changes, and so does the hashed portion of the filename of the build artefact. But I'd take my 'ideas' with a grain of salt as it's really just that, an idea on how this works based on using it. There is certainly more to it, and I have seen cache-trashing happening when switching between compiler versions in the same project, or when changing environment variables, which is something I'd expect not to be present if that hashing-story was entirely true.
Even that didn't happen here. Or if it happened, it happened too slowly. After all it seemed easy to generated caches larger than 10GB. After deleting the cache a couple of days ago entirely, it's now approaching 10GB already and I didn't work on |
In the `justfile`, this renames the old `ci-test` recipe to `ci-test-full`, and has `ci-test` no longer clean the target directory nor run journey tests. The CI `test` job thus remains the same, but it does moderately less work. A new CI job, `test-journey`, is introduced to run the journey tests (still via `ci-test-journey` recipe). This change is intended to allow greater parallelism, and possibly make caches work better. The CI `test` job has sometimes been a few minutes slower than before, ever since 5173e9a (GitoxideLabs#1668). See comments in GitoxideLabs#1668 for some discussion on this change.
In the `justfile`, this renames the old `ci-test` recipe to `ci-test-full`, and has `ci-test` no longer clean the target directory nor run journey tests. The CI `test` job thus remains the same, but it does moderately less work. A new CI job, `test-journey`, is introduced to run the journey tests (still via `ci-test-journey` recipe). This change is intended to allow greater parallelism, and possibly make caches work better. The CI `test` job has sometimes been a few minutes slower than before, ever since 5173e9a (GitoxideLabs#1668). See comments in GitoxideLabs#1668 for some discussion on this change.
In the `justfile`, this renames the old `ci-test` recipe to `ci-test-full`, and has `ci-test` no longer clean the target directory nor run journey tests. The CI `test` job thus remains the same, but it does moderately less work. A new CI job, `test-journey`, is introduced to run the journey tests (still via `ci-test-journey` recipe). This change is intended to allow greater parallelism, and possibly make caches work better. The CI `test` job has sometimes been a few minutes slower than before, ever since 5173e9a (GitoxideLabs#1668). See comments in GitoxideLabs#1668 for some discussion on this change.
Another setting shown in #1668 (review) that can probably be tightened is:
Currently we have no workflows that create or review pull requests. Furthermore, I believe this does not need to be enabled to allow Dependabot to open security updates and version updates, even if the setting to run Dependabot on GitHub Actions runners is enabled, so long as one is not doing something like implementing automatic merging of passing Dependabot pull requests, which we are not (and which I think we are unlikely to want here). I don't think it's really a security issue, at least with workflows that trigger on the events these do, to have this enabled. But it can always be easily reenabled if GHA workflows that create or approve PRs are ever added. |
Thanks for the hint - it's disabled now. |
This makes several changes to GitHub Actions workflows, which would create merge conflicts if submitted separately, and which even aside from that seemed, to varying degrees, like they would make more sense to do together than separately. But the commits are divided in such a way that breaking this up into a sequence of a few separate PRs would be easy to do, if preferred. There is also an associated recommendation for a change in the repository's settings.
The changes in this PR are as follows:
Use read-only
github.token
everywhere feasibleThis PR sets
permissions:
explicitly in each workflow and, where different jobs don't all need the same permissions, in individual jobs as well. The permissions are only what each job needs and, in particular,write
permissions are given only to the jobs that really require it (the jobs that create or modify releases in the release workflow).This change is not needed to safeguard against threats from malicious pull requests, and has no effect on such risks. Its primary significance to security is that it slightly decreases the risk from supply-chain attacks. Many GitHub Actions jobs run code from numerous transitive dependencies and other sources. (I say "slightly" because it does not affect the risk when running commands locally, when creating releases, or when using releases.) Secondarily, this may also decrease risks from unintentional vulnerabilities or other bugs in dependencies or in workflows themselves.
The reason this is unrelated to threats from malicious PRs is that, when PRs that originate from forks trigger the
pull_request
event, jobs in workflows triggered by that event already automatically get a read-onlygithub.token
. (This repository currently does not define any workflows that trigger on thepull_request_target
orworkflow_run
events, which would have additional security implications.)Specific details of this change are given in f41a58c.
I recommend also setting the default permissions of GitHub tokens in GitHub Actions jobs to read-only, by going to Settings → Actions → General and, under Workflow permissions, changing it from "Read and write permissions" to "Read repository contents and package permissions". That way, if a job is ever introduced with no explicit
permissions
in its definition or the containing workflow, the token will be read-only rather than read-write. That can be done at any time. Doing so without making any other changes will break the release workflow until the workflow is modified to specify the necessary write permissions (a change that this PR includes).Style workflows more consistently and slightly more spaciously
This applies the prevailing style everywhere, except that it puts blank lines between the top-level nodes of a job. (This does not change the spacing between steps, but rather between, for example,
steps
itself and other things at the same level.)It also makes some other improvements, such as replacing the last
actions/checkout@master
withactions/checkout@v4
, quotingenv
values when the absence of quotes makes them boolean or numeric in YAML (since they are still turned into strings to be placed in the environment, and the string representations are not always obvious), and removing an unnecessaryenv
key ofCI
, which is guaranteed to be set totrue
by the runner.Some of these improvements are directly motivated by the goal of preventing the addition of explicit
permissions
from making workflows and job definitions harder to read. Those are in 7eeeee2. In view of those, it seemed like a good idea to include the others, which otherwise would conflict with this PR or might conflict with other subsequent PRs, and which are conceptually along the same lines, but because they are not needed to allowpermissions
to be added without confusion, they are done in a separate commit, 44ff412.Bump cargo-diet version and express the logic more cleanly
87670a6 bumps cargo-diet to the latest stable version and makes some improvements to the steps, including splitting it into separate installation and running steps and making the running step
continue-on-error
, instead of writing|| true
after the command that (through ajust
rule) runscargo diet
.Provide richer log output for
tests-pass
20794ff adapts a comment in
tests-pass
to be a step name, so everything is clear in logs even if the reader is unfamiliar with the workflow.Detect new jobs that may be intended to block auto-merge
This only affects the
ci.yml
workflow, where thetests-pass
job stands in for all the jobs that are meant to block PR auto-merge, by being set up as a required job and depending on the others (#1551). It intentionally does not depend on all jobs in the workflow: of course it does not depend on itself, and it also does not depend ontest-fixtures-windows
(#1657, #1663). Another change in this PR, described below, also causes it no longer to depend onwasm
.That existing situation unfortunately makes it easy to add a job that it doesn't depend on, by simply forgetting to add it or forgetting to make a decision on whether the new job should block PR auto-merge.
The solution is to maintain two lists of jobs that are checked to ensure are disjoint and form, as their union, all jobs defined in
ci.yml
. A job to check this is added in 4e672ef, 7b7a819, and eba50ae.Make the
wasm
jobs fail the workflow but not block auto-mergePrior to the changes in this PR, the WebAssembly jobs would report a successful conclusion even if they failed, by using the
continue-on-error
key at the job level. This was needed to keep them from blocking auto-merge, but only becausewasm
was listed as a dependency oftests-pass
. This changes it so that it will report a failing conclusion (and thus fail the workflow) but not prevent PR auto-merge from working.Since these jobs are expected to pass--if they fail then it indicates a regression in WASM compatibility, and they rarely fail--this seems like a more suitable approach. See 253b716 for full details.
More broadly, although I do not propose adopting any rule about it, I think we should be reluctant to use
continue-on-error
at the job level inci.yml
, ever since branch protection rules and required checks for auto-merge have been set up and (since #1551) made partially configurable in the workflow file:push
orpull_request
whose failure doesn't represent something important, since such a job should usually either not exist or not run automatically.continue-on-error
at the job level and not, for example, in a specific step of a job that has at least one step not markedcontinue-on-error
.But I do not know of any way for one job to depend on specific matrix jobs while not depending on others. This is inapplicable to the
wasm
job definition because, although it uses a matrix strategy, the current intent is that neither job generated from it block PRs. But it does apply to thecargo deny
job definition, whoseadvisories
job is markedcontinue-on-error
, while itsbans licenses sources
job is not.This is a simple job definition, which could be broken up into two non-matrix job definitions with minimal code duplication. It seems to me that this should be done, so that a new advisory causes the job to have a failing conclusion and fail the workflow even as it does not block PRs. I had originally planned to include that in this PR, but I decided not to do so, since the question of whether to do shoud probably be considered separately. There are some interesting subtleties, such as in the effect on when commands like
@dependabot merge
will go ahead with a merge (relevant because Dependabot security updates are created for all ecosystems, even thoughgithub-actions
is the only ecosystem that gets Dependabot version updates).That
advisories
job is currently failing--both here and if it is re-run onmain
--due to RUSTSEC-2024-0384 (sebcrozet/instant#52), announcing thatinstant
is unmaintained andweb-time
should be used instead. I had initially hoped to fix that here, but it is not feasible to do so: we useinstant
only as a transitive dependency through other crates that are not part of this repository and whose dependencies should be updated to useweb-time
. That situation is unaffected by any changes in this PR.Update other GitHub-specific YAML configuration
e45ea25 removes the old Dependabot configuration file for the
cargo
ecosystem, and adds a comment to the current Dependabot configuration file to explain why that ecosystem is not included. The explanation only links to #144, but accounts for the full current situation (see #143, #144, #1254 (comment), and #1357).4564a64 removes the empty pull request template file, as discussed in #1667.
(These changes are done here rather than in a separate PR because I was already editing
dependabot.yml
for style, and also because, as detailed above, I had originally planned to include other changes that might affect Dependbot indirectly, but that are deferred so they can be reviewed separately.)