Improve CI `permissions`, auto-merge maintainability, and clarity #1668

EliahKagan · 2024-11-11T23:13:14Z

This makes several changes to GitHub Actions workflows, which would create merge conflicts if submitted separately, and which even aside from that seemed, to varying degrees, like they would make more sense to do together than separately. But the commits are divided in such a way that breaking this up into a sequence of a few separate PRs would be easy to do, if preferred. There is also an associated recommendation for a change in the repository's settings.

The changes in this PR are as follows:

Use read-only `github.token` everywhere feasible

This PR sets permissions: explicitly in each workflow and, where different jobs don't all need the same permissions, in individual jobs as well. The permissions are only what each job needs and, in particular, write permissions are given only to the jobs that really require it (the jobs that create or modify releases in the release workflow).

This change is not needed to safeguard against threats from malicious pull requests, and has no effect on such risks. Its primary significance to security is that it slightly decreases the risk from supply-chain attacks. Many GitHub Actions jobs run code from numerous transitive dependencies and other sources. (I say "slightly" because it does not affect the risk when running commands locally, when creating releases, or when using releases.) Secondarily, this may also decrease risks from unintentional vulnerabilities or other bugs in dependencies or in workflows themselves.

The reason this is unrelated to threats from malicious PRs is that, when PRs that originate from forks trigger the pull_request event, jobs in workflows triggered by that event already automatically get a read-only github.token. (This repository currently does not define any workflows that trigger on the pull_request_target or workflow_run events, which would have additional security implications.)

Specific details of this change are given in f41a58c.

I recommend also setting the default permissions of GitHub tokens in GitHub Actions jobs to read-only, by going to Settings → Actions → General and, under Workflow permissions, changing it from "Read and write permissions" to "Read repository contents and package permissions". That way, if a job is ever introduced with no explicit permissions in its definition or the containing workflow, the token will be read-only rather than read-write. That can be done at any time. Doing so without making any other changes will break the release workflow until the workflow is modified to specify the necessary write permissions (a change that this PR includes).

Style workflows more consistently and slightly more spaciously

This applies the prevailing style everywhere, except that it puts blank lines between the top-level nodes of a job. (This does not change the spacing between steps, but rather between, for example, steps itself and other things at the same level.)

It also makes some other improvements, such as replacing the last actions/checkout@master with actions/checkout@v4, quoting env values when the absence of quotes makes them boolean or numeric in YAML (since they are still turned into strings to be placed in the environment, and the string representations are not always obvious), and removing an unnecessary env key of CI, which is guaranteed to be set to true by the runner.

Some of these improvements are directly motivated by the goal of preventing the addition of explicit permissions from making workflows and job definitions harder to read. Those are in 7eeeee2. In view of those, it seemed like a good idea to include the others, which otherwise would conflict with this PR or might conflict with other subsequent PRs, and which are conceptually along the same lines, but because they are not needed to allow permissions to be added without confusion, they are done in a separate commit, 44ff412.

Bump cargo-diet version and express the logic more cleanly

87670a6 bumps cargo-diet to the latest stable version and makes some improvements to the steps, including splitting it into separate installation and running steps and making the running step continue-on-error, instead of writing || true after the command that (through a just rule) runs cargo diet.

Provide richer log output for `tests-pass`

20794ff adapts a comment in tests-pass to be a step name, so everything is clear in logs even if the reader is unfamiliar with the workflow.

Detect new jobs that may be intended to block auto-merge

This only affects the ci.yml workflow, where the tests-pass job stands in for all the jobs that are meant to block PR auto-merge, by being set up as a required job and depending on the others (#1551). It intentionally does not depend on all jobs in the workflow: of course it does not depend on itself, and it also does not depend on test-fixtures-windows (#1657, #1663). Another change in this PR, described below, also causes it no longer to depend on wasm.

That existing situation unfortunately makes it easy to add a job that it doesn't depend on, by simply forgetting to add it or forgetting to make a decision on whether the new job should block PR auto-merge.

The solution is to maintain two lists of jobs that are checked to ensure are disjoint and form, as their union, all jobs defined in ci.yml. A job to check this is added in 4e672ef, 7b7a819, and eba50ae.

Make the `wasm` jobs fail the workflow but not block auto-merge

Prior to the changes in this PR, the WebAssembly jobs would report a successful conclusion even if they failed, by using the continue-on-error key at the job level. This was needed to keep them from blocking auto-merge, but only because wasm was listed as a dependency of tests-pass. This changes it so that it will report a failing conclusion (and thus fail the workflow) but not prevent PR auto-merge from working.

Since these jobs are expected to pass--if they fail then it indicates a regression in WASM compatibility, and they rarely fail--this seems like a more suitable approach. See 253b716 for full details.

More broadly, although I do not propose adopting any rule about it, I think we should be reluctant to use continue-on-error at the job level in ci.yml, ever since branch protection rules and required checks for auto-merge have been set up and (since #1551) made partially configurable in the workflow file:

The granularity of having jobs that unambiguously fail but do not block automatic merging seems like it should usually capture the situation where the failure represents something important but that nonetheless would not be out of the ordinary to dismiss.
Other than to represent something that is temporarily broken but intended to be fixed immediately, it would be strange to have a job triggered by push or pull_request whose failure doesn't represent something important, since such a job should usually either not exist or not run automatically.
This reasoning only applies to continue-on-error at the job level and not, for example, in a specific step of a job that has at least one step not marked continue-on-error.

But I do not know of any way for one job to depend on specific matrix jobs while not depending on others. This is inapplicable to the wasm job definition because, although it uses a matrix strategy, the current intent is that neither job generated from it block PRs. But it does apply to the cargo deny job definition, whose advisories job is marked continue-on-error, while its bans licenses sources job is not.

This is a simple job definition, which could be broken up into two non-matrix job definitions with minimal code duplication. It seems to me that this should be done, so that a new advisory causes the job to have a failing conclusion and fail the workflow even as it does not block PRs. I had originally planned to include that in this PR, but I decided not to do so, since the question of whether to do shoud probably be considered separately. There are some interesting subtleties, such as in the effect on when commands like @dependabot merge will go ahead with a merge (relevant because Dependabot security updates are created for all ecosystems, even though github-actions is the only ecosystem that gets Dependabot version updates).

That advisories job is currently failing--both here and if it is re-run on main--due to RUSTSEC-2024-0384 (sebcrozet/instant#52), announcing that instant is unmaintained and web-time should be used instead. I had initially hoped to fix that here, but it is not feasible to do so: we use instant only as a transitive dependency through other crates that are not part of this repository and whose dependencies should be updated to use web-time. That situation is unaffected by any changes in this PR.

Update other GitHub-specific YAML configuration

e45ea25 removes the old Dependabot configuration file for the cargo ecosystem, and adds a comment to the current Dependabot configuration file to explain why that ecosystem is not included. The explanation only links to #144, but accounts for the full current situation (see #143, #144, #1254 (comment), and #1357).

4564a64 removes the empty pull request template file, as discussed in #1667.

(These changes are done here rather than in a separate PR because I was already editing dependabot.yml for style, and also because, as detailed above, I had originally planned to include other changes that might affect Dependbot indirectly, but that are deferred so they can be reviewed separately.)

In GitHub Actions workflow files, this separates top-level keys within a job with blank lines, which was sometimes but not usually already done, since this seems to improve readability. Besides that, this also applies the prevailing style more consistently. This style tweak is in preparation for adding explicit `permissions` keys (so that doing so won't decrease readability).

This adds `permissions` keys at workflow or job level in a number of places, usually specifying minimal required permissions. When a `permissions` key contains at least one subkey setting some permission (in any way), all permissions it does not have subkeys for are treated as if they were present with a value of `none`. This relies on that rather than listing all unneeded permissions everywhere with `none`. So most `permissions` added here have only `contents: read`, and: - The only place where `none` is specified explicitly is in the CIFuzz workflow, where no permissions (of those controllable through `permissions`) are needed. - The only place any `write` permissions are specified are `contents: write` in the jobs of the release workflow that need it. All jobs involved in preparing a release currently have at least one step that requires this. But `contents: read` is still applied at the workflow level, because the `installation` job (which tests installing) does not need any `write` permissions. Note that some jobs that don't have any write permissions of the kind that is controlled under a `permissions` key do still perform writes to data accessible outside of their run: caching (including creating new caches), and uploading artifacts, still works. Relevant documentation: - https://docs.github.com/en/actions/writing-workflows/workflow-syntax-for-github-actions#permissions - https://docs.github.com/en/actions/writing-workflows/workflow-syntax-for-github-actions#jobsjob_idpermissions - https://docs.github.com/en/rest/authentication/permissions-required-for-github-apps

This makes more stylistic adjustments to YAML workflows files. (Previous recent adjustments were, for the most part, only those needed to ensure clarity would be preserved while adding `permissions:`. This commit does some others that weren't needed for that.) The changes made here are: - Use `actions/checkout@v4`, not `actions/checkout@master`, in the one place where `@master` was still used for it. This is not strictly speaking a purely stylistic change, since it is eventually expected to make a difference, by holding onto major version 4 until the version is updated (which Depedabot automates). - Don't explicitly set the `CI` environment variable in the `test` job. This variable is guaranteed to be set automatically with the value of `true` by the GitHub Actions runner. Furthermore, no environment variables to indicate CI are explicitly set for other jobs, even though others rely on their presence as well due to the use of `is_ci` through `gix-testtools`. - Specify all environment variables' values as strings. Most values in YAML are strings without explicit quoting. This does not add quotation marks in those cases. But some values are parsed as some other type, such as integer, unless quoted. That makes it less obvious what the value will actually be in the environment, since it will be implicitly converted to a string, which does not always consist of the same sequence of characters as the original expression. This effect is most prominent for booleans (e.g. unquoted `yes` and `true` in YAML have the same meaning) but not entirely limited to them. In addition, such expressions may also lead to confusion if it is misread to mean that they will retain any kind of type information. So this quotes such values (except when other changes here remove them). - Minor quoting style adjustments, for YAML strings are quoted. Omit quotes when clearly not needed, and use single-quotes by default when quotes are needed. - Use 2-space indent inside scripts in script steps for consistency with other scripts. Most existing multi-line script steps, as well as all shell scripts, use 2-space indents already. - Remove `\` in a script step where it was not needed for line continuation because it followed a `|`. - In the `wasm` job's script steps, put `name:` and, when present, `if:`, ahead of `run:`, so it's clear what each step is for. (This was already done where applicable in other jobs.) - In the `wasm` job's script steps, split single-line `run:` values with `&&` into separate commands, one per line, where doing so has the same effect due to the `-e` option the CI runner automatically passes to `bash` shells unless `shell:` is overridden with a value that contains `{0}`. - In the `wasm` job's script steps, split single-line `run:` values with `;` into separate lines, including running loops over multiple lines, for readability. - In the `wasm` job, extract `${{ matrix.target }}` to an environment variable. This seems to make the steps slightly more readable, since `wasm` steps make heavy use of it. - Extremely minor adjustment to array style, for consistency. - In commands, use `--` where guaranteed to be supported if any non-option argument begins with parameter expansion or a glob. This was already almost, but not quite, already done. Since the possible values can be inferred from nearby code and do not begin with `-`, the reason is to clearly communicate that they are non-option arguments, rather than to fix any actual bug. (This continues to be avoided where is not guaranteed correct.) - Use the `-e` option before a pattern argument to `grep` formed through parameter expansion (same rationale as `--` above). - Use `-E` rather than `-r` with `sed`, since `-E` is standardized.

In the past (GitoxideLabs#143), Dependabot version updates had been used to keep `cargo` dependencies up to date. This was removed in favor of doing manual updates based on automatic reports from `cargo deny` and the old `dependabot.yml` was kept but renamed to disable it and point people to GitoxideLabs#144 to learn about the change. Since then, Dependabot security updates, which are distinct from Dependabot version updates, were enabled (see GitoxideLabs#1254), and later, Dependabot version updates were reintroduced for GitHub Actions only (GitoxideLabs#1357). At that point, there were two Dependabot-related YAML files: the old disabled one, and the new one for GHA. This removes the old one, explaining the situation in a comment in the new one, including a link to GitoxideLabs#144. While doing so, this also adjusts the YAML code style there, to bring it in line with the style of most other YAML files in the repository.

This gives `name:` keys to the `test-pass` steps, turning the first step's comment into its name. This way, the output can be as clear as the workflow file itself.

This intentionally fails right now, by omitting `tests-pass` itself as a job that `tests-pass` should not depend on, in order to ensure that it is able to fail. Once this is observed (and any other bugs fixed), this omission should be corrected, and then it should pass.

- Show the job lists for better debugging. - Reword the `diff` step name to be clearer (albeit less precise).

Changes: - Bump the version of `cargo-diet` used in the `lint` job from 1.2.4 to 1.2.7. - Use the version tag not just as an operand to `--tag`, but also as the ref from which the installation script itself is obtained. (The rationale is that the effect of skew here would probably be unintuitive, and also that we had been specifying `master` but the default branch of the `cargo-diet` repository is now `main`.) - Instead of using `|| true` on the step that runs `just check-size` (which needs cargo-diet), split installing and running into separate steps, and mark the running step `continue-on-error`.

The WebAssembly CI job definition had an unconditional job-level `continue-on-error` set to `true`, so that it would always report a successful conclusion, even on failure. This made sense before PR auto-merge was set up. But these jobs do not typically fail, and their failures should be more apparent, even if they should not yet block PRs from being auto-merged. This commit changes the `wasm` jobs so they are able to fail and cause the workflow as a whole to have a failing status, while still not blocking PR auto-merge. `continue-on-error` is removed from the job definition, but it is also no longer made a dependency of the required `tests-pass` check.

See GitoxideLabs#1667.

EliahKagan · 2024-11-12T07:46:24Z

I didn't expect that updating dependencies in Cargo.lock (5173e9a) would be able to fix the cargo deny advisories error about depending on the unmaintained instant crate yet, but it looks like it has. Thanks!

I had totally missed that this could already be fixed easily, because I didn't realize that fastrand had completely stopped depending on instant since 2.0.0 due to smol-rs/fastrand@eb78d89 (smol-rs/fastrand#60), and we had been depending on an earlier version than that. Sorry about my inaccurate claim that the dependency on instant was not possible to remove in this PR!

Byron

Thanks so much for this massive improvement! I can't wait to see it merged and adjust the auto-merge configuration to listen to tests-pass instead of a long list of individual jobs.

Thanks a lot for the hint the additional reduction of risk. I assume that this repository setting is only doable if permissions are set correctly on a PR level, so I am glad this is possible now.

This is a simple job definition, which could be broken up into two non-matrix job definitions with minimal code duplication. It seems to me that this should be done, so that a new advisory causes the job to have a failing conclusion and fail the workflow even as it does not block PRs. I had originally planned to include that in this PR, but I decided not to do so, since the question of whether to do shoud probably be considered separately.

I agree, let's merge this and have the WASM-dematrixification (if that's a word ;)) in a separate PR. There has been some interest in continuing the development of WASM-support lately so improving CI in that regard will certainly be helpful to them as well.

I didn't expect that updating dependencies in Cargo.lock (5173e9a) would be able to fix the cargo deny advisories error about depending on the unmaintained instant crate yet, but it looks like it has. Thanks!

No worries at all - at this time I didn't even read up to your assessment of the situation yet and just responded with my standard reflex in these situations 😁. Luckily, it worked.

.github/workflows/ci.yml

EliahKagan · 2024-11-12T07:57:17Z

Thanks so much for this massive improvement! I can't wait to see it merged and adjust the auto-merge configuration to listen to tests-pass instead of a long list of individual jobs.

You give me too much credit, because the improvements here are not needed for that, though I believe they do make it easier to keep it working properly. tests-pass was introduced in #1551 and has been appropriate to mark as the solely required check in ci.yml since then. I am also pretty sure that you have already made it so, since none of the other checks show as required in PRs since around that time.

Edit: Yes, this was reported as done in #1551 (comment).

#1551 did not make it possible for tests-pass to substitute for listing checks from other workflows as required, but nothing in this PR extends it in that way, either. As far as I know, there is no change to this repository's required checks (or any other branch protection related settings) that should be done following the merge of this PR.

EliahKagan · 2024-11-12T08:19:15Z

CI failed here on 5173e9a with a very strange error. It can be examined by expanding the annotations shown at the top of that page. The error reports that the GHA runner machine was out of disk space. Although it's possible for a new dependency to have a bug that causes massively increased disk space usage, I think that's probably not the cause, because the error does not happen for that same commit on my fork. My guess is that it might go away if the workflow is re-run, which I recommend trying.

Edit: Also, the error is preceded by a warning about low disk space, and the warning seems to have been generated prior to any toolchains or dependencies being installed, so the changes in 5173e9a are unlikely to have triggered it.

Further edit: Unfortunately that appears not to be the case. The error does happen again in 38edb2c (#1670), though again it happens only in the pull_request run here, and not in the push run for the same commit in my fork. For this reason, I am concerned that it might be triggered by the changes here. It's appealing to think this might be due to a transient problem with the GitHub infrastructure, but there are no known recent incidents. Maybe it has something to do with caching in these workflows.

It also passes in my fork on a fork-internal pull_request trigger, both on a PR equivalent to this one (EliahKagan#3), and on a PR that is just 5173e9a cherry-picked (EliahKagan#4). I don't know why this is failing on the upstream repository but passing in my fork. However, I think it is actually failing later than it seems: I think the test step runs, and gets quite far through, but then when the disk is out of space, it is canceled and that step's log is not preserved. I've opened #1671 to see if a new PR here, with an empty commit and then with 5173e9a cherry-picked (comparable to EliahKagan#4), will fail here.

The problem seems to happen only on the upstream repository, or maybe only on PRs from forks. This initially makes no changes, to verify that the problem does not simply happen regardless of content. See GitoxideLabs#1668, and in particular GitoxideLabs#1668 (comment), as well as #3, and #4, for details.

Byron · 2024-11-12T10:23:08Z

You give me too much credit, because the improvements here are not needed for that, though I believe they do make it easier to keep it working properly. tests-pass was introduced in #1551 and has been appropriate to mark as the solely required check in ci.yml since then. I am also pretty sure that you have already made it so, since none of the other checks show as required in PRs since around that time.

Edit: Yes, this was reported as done in #1551 (comment).

#1551 did not make it possible for tests-pass to substitute for listing checks from other workflows as required, but nothing in this PR extends it in that way, either. As far as I know, there is no change to this repository's required checks (or any other branch protection related settings) that should be done following the merge of this PR.

Oh, I see, thanks for the clarification. I was on an entirely different page there apparently.

No space left on device

For posterity, here is the way the runner itself crashes:

System.IO.IOException: No space left on device : '/home/runner/runners/2.320.0/_diag/Worker_20241112-075012-utc.log'
at System.IO.RandomAccess.WriteAtOffset(SafeFileHandle handle, ReadOnlySpan1 buffer, Int64 fileOffset) at System.IO.Strategies.BufferedFileStreamStrategy.FlushWrite() at System.IO.StreamWriter.Flush(Boolean flushStream, Boolean flushEncoder) at System.Diagnostics.TextWriterTraceListener.Flush() at GitHub.Runner.Common.HostTraceListener.WriteHeader(String source, TraceEventType eventType, Int32 id) at GitHub.Runner.Common.HostTraceListener.TraceEvent(TraceEventCache eventCache, String source, TraceEventType eventType, Int32 id, String message) at System.Diagnostics.TraceSource.TraceEvent(TraceEventType eventType, Int32 id, String message) at GitHub.Runner.Worker.Worker.RunAsync(String pipeIn, String pipeOut) at GitHub.Runner.Worker.Program.MainAsync(IHostContext context, String[] args) System.IO.IOException: No space left on device : '/home/runner/runners/2.320.0/_diag/Worker_20241112-075012-utc.log' at System.IO.RandomAccess.WriteAtOffset(SafeFileHandle handle, ReadOnlySpan1 buffer, Int64 fileOffset)
at System.IO.Strategies.BufferedFileStreamStrategy.FlushWrite()
at System.IO.StreamWriter.Flush(Boolean flushStream, Boolean flushEncoder)
at System.Diagnostics.TextWriterTraceListener.Flush()
at GitHub.Runner.Common.HostTraceListener.WriteHeader(String source, TraceEventType eventType, Int32 id)
at GitHub.Runner.Common.HostTraceListener.TraceEvent(TraceEventCache eventCache, String source, TraceEventType eventType, Int32 id, String message)
at System.Diagnostics.TraceSource.TraceEvent(TraceEventType eventType, Int32 id, String message)
at GitHub.Runner.Common.Tracing.Error(Exception exception)
at GitHub.Runner.Worker.Program.MainAsync(IHostContext context, String[] args)
Unhandled exception. System.IO.IOException: No space left on device : '/home/runner/runners/2.320.0/_diag/Worker_20241112-075012-utc.log'
at System.IO.RandomAccess.WriteAtOffset(SafeFileHandle handle, ReadOnlySpan`1 buffer, Int64 fileOffset)
at System.IO.Strategies.BufferedFileStreamStrategy.FlushWrite()
at System.IO.StreamWriter.Flush(Boolean flushStream, Boolean flushEncoder)
at System.Diagnostics.TextWriterTraceListener.Flush()
at System.Diagnostics.TraceSource.Flush()
at GitHub.Runner.Common.TraceManager.Dispose(Boolean disposing)
at GitHub.Runner.Common.TraceManager.Dispose()
at GitHub.Runner.Common.HostContext.Dispose(Boolean disposing)
at GitHub.Runner.Common.HostContext.Dispose()
at GitHub.Runner.Worker.Program.Main(String[] args)

So it looks like writing a trace to a log fails during normal operation, and that even before anything happens.

The cache is pretty big, but no reason to fail this soon.

On the current try, it gets to running the tests at least and I will keep watching it - maybe it works now and all this was no more than a bad (and transient) dream.

Byron · 2024-11-12T10:31:11Z

It really does fail when trying to compile:

   Compiling gix-protocol v0.46.0 (/home/runner/work/gitoxide/gitoxide/gix-protocol)
   Compiling gix-status v0.14.0 (/home/runner/work/gitoxide/gitoxide/gix-status)
   Compiling gix-merge v0.0.0 (/home/runner/work/gitoxide/gitoxide/gix-merge)
   Compiling gix-worktree-state v0.14.0 (/home/runner/work/gitoxide/gitoxide/gix-worktree-state)
   Compiling gix-negotiate v0.16.0 (/home/runner/work/gitoxide/gitoxide/gix-negotiate)
   Compiling gix-mailmap v0.25.0 (/home/runner/work/gitoxide/gitoxide/gix-mailmap)
   Compiling tracing-forest v0.1.6
   Compiling gix v0.67.0 (/home/runner/work/gitoxide/gitoxide/gix)
   Compiling gix-fsck v0.7.0 (/home/runner/work/gitoxide/gitoxide/gix-fsck)
   Compiling env_logger v0.10.2
   Compiling gitoxide-core v0.42.0 (/home/runner/work/gitoxide/gitoxide/gitoxide-core)
rustc-LLVM ERROR: IO failure on output stream: No space left on device
rustc-LLVM ERROR: IO failure on output stream: No space left on device
error: could not compile `gitoxide-core` (lib)
warning: build failed, waiting for other jobs to finish...
error: could not compile `gix` (lib)
error: Recipe `journey-tests-async` failed on line 220 with exit code 101
Error: Process completed with exit code 101.

And then the log gets truncated so when looking at it again, it seems like not much happened. Let me try something….

That way, the runner hopefully doesn't run out of disk space while it is compiling the various targets used in a journey test.

EliahKagan · 2024-11-12T10:42:42Z

Great catch--I had at some point wondered why nothing ever called clear-target and I should've thought to ask!

The #1671 test PR here passes initially, and then fails after cherry-picking 5173e9a from here. So the increased space usage is triggered by upgrading the crates, somehow. Maybe this is due to dependencies being heavier, but my guess is that it is instead due to there being more stuff that has to be generated rather than obtained from the GitHub Actions cache. Alternatively, maybe the problem is that it retrieves data from the cache and saves them, and the size of those data, some of which can't be used, combined with the files that are generated, is too big.

Assuming 132696d fixes this, I'll close #1671.

EliahKagan · 2024-11-12T10:54:34Z

🎉 That worked! 🎉

Byron · 2024-11-12T11:28:26Z

The #1671 test PR here passes initially, and then fails after cherry-picking 5173e9a from here. So the increased space usage is triggered by upgrading the crates, somehow. Maybe this is due to dependencies being heavier, but my guess is that it is instead due to there being more stuff that has to be generated rather than obtained from the GitHub Actions cache. Alternatively, maybe the problem is that it retrieves data from the cache and saves them, and the size of those data, some of which can't be used, combined with the files that are generated, is too big.

Yes, I think it's that!

Now the cache is basically useless, yet it consumes a lot of space. Alternatively, I could have cleared it, probably lovering the footprint significantly.

Now I deleted all caches, 35GB of 10GB available, way above the limit, and think that it might now work with the extra clearing which can cut down the build time a little, presumably.

Also, I think caches will become more inefficient over time as also Cargo doesn't delete old caches, so keeps various versions of the same crate around, each time it changes. At least…so I think, judging by the continuously growing crashes on my disk.

These didn't make it into GitoxideLabs#1668. Besides comments, the changes are for consistency with the prevailing style, usually by omitting redundant YAML quoting. Removal of the outer double quotes in the `if` in `tests-pass` is a case of this, and produces an equivalent node in parsing (i.e. its equivalence does not depend on anything about GHA itself). But just to be sure, I did run yq '.jobs.tests-pass.steps[0].if' .github/workflows/ci.yml before and after the change, to ensure the output was the same. The other change here that deserves comment is the removal of `--` as an argument to a `diff` command. When any path argument is formed from paramter expansion or from a glob with a leading `*` or other globbing character, `--` helps express that the following arguments are not options. For `git diff`, a `--` expresses that the following arguments are neither options nor refs, but paths, so all `git diff` commands with paths in the CI workflows use `--` even if no shell expansions are involved. (In practice this means `--` is often useful for `diff` with paths and, based on this habit, I had inadvertently written a `--` where neither of the above scenarios applied. But that had actually decreased stylistic consistency because we are not using `--` elsewhere that the meaning of all arguments after it is unambiguous even without examining any surrounding context.)

EliahKagan · 2024-11-12T22:50:45Z

Now the cache is basically useless, yet it consumes a lot of space.

Why is it useless? Is this because, since 132696d, the data that make it worthwhile are now deleted in the ci-test rule when it calls clear-target before calling journey-tests? If so, this would apply only to the test job, and not to any other jobs, right?

Alternatively, I could have cleared it, probably lovering the footprint significantly.

Maybe that is why I was unable to reproduce the problem internally to my fork, even when testing with both push and pull_request triggers: the GitHub Actions cache in my fork was less than 9 GiB large when I checked it a day or two before opening this PR, and furthermore I had also cleared it, to ensure that creating new cache entries did not require an actions: write or other write permission.

Also, I think caches will become more inefficient over time as also Cargo doesn't delete old caches, so keeps various versions of the same crate around, each time it changes. At least…so I think, judging by the continuously growing crashes on my disk.

If the files where this happens are in the target directory, then I think at least in principle the exclusion of files whose modification time is older than 2 weeks should prevent it from growing in this way. That assumes, of course, that the original mtimes metadata are set on files retrieved from the GHA cache. I have not checked to see if that is the case.

Another scheduled workflow could be added that clears all rust-cache-related caches every week or two. I can add that if you want to see if it makes a difference. But maybe it would be better to investigate why mtime checking is not already solving this problem.

I wonder if the reason the cache was so large here in this upstream repository is that, separately from any recent changes, different branches (and different PRs) have their own caches to keep them from poisoning one another's caches, which would sometimes be a correctness problem and, for PRs from forks, a security problem. If I understand the documentation about this correctly, caches from main can be used by PRs, but when PRs save caches, those caches are not available in runs triggered by pushes to main, nor accessible in other PRs.

It is possible for different jobs to share caches, and maybe there are some cases where that would make sense here. In particular, the test-fixtures-windows job seems like it needn't cache independently from the test-fast windows-latest job. So long as the jobs with write permissions in the release workflow do not use caches from other jobs (currently they don't use caches at all, which seems fine to me, as they don't need to run especially fast), I think there should be no significant reduction in security from sharing caches between jobs. Whether that would make anything faster or make the total size of all caches smaller is another question; I am not sure. In addition, there is the problem that the rust-cache action seems like it might not be designed to easily facilitate sharing across separate jobs (Swatinem/rust-cache#205).

Byron · 2024-11-13T06:13:09Z

Why is it useless? Is this because, since 132696d, the data that make it worthwhile are now deleted in the ci-test rule when it calls clear-target before calling journey-tests? If so, this would apply only to the test job, and not to any other jobs, right?

Thanks for picking this up. I was blunt with words, and think "inefficient" would have been better. It's probably never useless, it is just reduced in value to the point where it might cost more to maintain the cache than the time it saves.

If the files where this happens are in the target directory, then I think at least in principle the exclusion of files whose modification time is older than 2 weeks should prevent it from growing in this way. That assumes, of course, that the original mtimes metadata are set on files retrieved from the GHA cache. I have not checked to see if that is the case.

While looking up the docs I did realize that I knew nothing at all about how it works - it's much more adapted to the task than I thought. Here is an excerpt:

In particular, the workspace crates themselves are not cached since doing so is Swatinem/rust-cache#37 (comment). For this reason, this action automatically sets CARGO_INCREMENTAL=0 to disable incremental compilation, so that the Rust compiler doesn't waste time creating the additional artifacts required for incremental builds.

If that environment variable actually carries over, then the worst offender of huge target directories would be no more. However, it will still hash its output, even if regenerated from scratch, so multiple versions of the same crate will still be created.
If the 2-week rule holds, then at least there is a limit on how many of those duplicates it would pick up. This also means that dependencies that don't change a lot and remain stable - the case for the majority of the dependency graph - they might fall out of the cache even though they are still very useful.

Whether that would make anything faster or make the total size of all caches smaller is another question; I am not sure. In addition, there is the problem that the rust-cache action seems like it might not be designed to easily facilitate sharing across separate jobs (Swatinem/rust-cache#205).

Making Windows jobs faster would of course be great, but I wouldn't push it if rust-cache isn't made for that.
Regarding the total size of caches, here it was 35GB of 10GB, so I think they are generous in enforcing such limits and I personally wouldn't care about cache-size reduction related to the overall amount stored by GitHub. I would care, however, about the size of the individual cache, but don't think there is any influence. A world where the cache only contains files that are actually used is probably never going to happen.

EliahKagan · 2024-11-13T21:07:38Z

I certainly have no objection to characterizing caching as useless when its goal is to decrease overall load or running time but instead it increases it. I would also call that useless. (The software used to do the caching can of course still be useful in many, even most, other cases, where it does provide savings.) So it didn't seem to me that you were too blunt at all. Instead, I wanted to understand the source of the inefficiency, as well as figure out if there are any current uses of the rust-cache action in this project's workflows that should be removed.

As it currently stands (with 132696d), the benefit of caching build artifacts in the CI test job seems fairly low, because we:

Retrieve cached artifacts, which are those that support whatever build steps were run between the last cargo clean and writing the cache.
Run check doc unit-tests.
Run clear-target, which runs cargo clean.
Run ci-journey-tests.
Cache artifacts from builds done as part of ci-journey-tests, since others were cleaned.

This is to say that we cache artifacts for ci-journey-tests, but ci-journey-tests never sees those artifacts. Instead, check, doc, and unit-tests get those artifacts, and they may be able to use some of them, since we are talking about dependencies, but it is probably not nearly as good as if they got the artifacts produced from those steps. Then both the artifacts from ci-journey-tests that were retrieved from the cache, and other artifacts from check doc unit-tests, are removed, and ci-journey-tests has to start off with a fresh build.

If the foregoing analysis is correct, then we should try out at least one of these changes in the hope that it will improve running time:

Not use the GitHub Actions cache for the CI test job, or
Split the test job into two jobs, at the breaking point where we are running ci-journey-tests, by splitting out the running of journey tests to a new job. Then it would no longer be necessary to delete artifacts during the job (i.e., we would omit clear-target).

The second approach is elegant in that it splits the jobs where they are effectively almost being split by the running of cargo clean. At first glance it seems like a perfect solution. But I am actually skeptical: it might turn out that, with what seems to be an increased size of build artifacts for dependencies ever since Cargo.lock was updated, this would--precisely because it would manage to cache everything we want to cache--cause the cache to grow too big.

But I think it's worth trying, so I'll open a PR for it. The running time of the CI test job has increased to about 20 minutes, and maybe this can speed things up. Even if not, it could help with that for reasons unrelated to caching, in that it would allow for more parallelism across runners.

If that environment variable actually carries over

Fortunately, it does.

However, it will still hash its output, even if regenerated from scratch, so multiple versions of the same crate will still be created.

Is this due to the effect of builds being nondeterministic, or something else?

Making Windows jobs faster would of course be great, but I wouldn't push it if rust-cache isn't made for that.

In that case, I will hold off, for now, on attempting to cause different jobs to share caches.

Regarding the total size of caches, here it was 35GB of 10GB, so I think they are generous in enforcing such limits and I personally wouldn't care about cache-size reduction related to the overall amount stored by GitHub.

They are generous in the sense that new cache entries are created, and in that old entries are sometimes dropped gradually rather than immediately. But my understanding of the documentation is that, while it's fine for the cache to grow bigger than 10 GB from time to time, this could significantly worsen performance when it happens frequently:

If you exceed the limit, GitHub will save the new cache but will begin evicting caches until the total size is less than the repository limit. The cache eviction process may cause cache thrashing, where caches are created and deleted at a high frequency. To reduce this, you can review the caches for a repository and take corrective steps, such as removing caching from specific workflows. For more information, see "Managing caches."

Byron · 2024-11-14T06:01:27Z

This is to say that we cache artifacts for ci-journey-tests, but ci-journey-tests never sees those artifacts. Instead, check, doc, and unit-tests get those artifacts, and they may be able to use some of them, since we are talking about dependencies, but it is probably not nearly as good as if they got the artifacts produced from those steps. Then both the artifacts from ci-journey-tests that were retrieved from the cache, and other artifacts from check doc unit-tests, are removed, and ci-journey-tests has to start off with a fresh build.

I think on top of that is the issue of having so many feature toggles, which make reuse of workspace dependencies (i.e. gix* crates) unlikely between unit tests and journey tests. Ultimately, I'd think that the only effective portion of the cache is the one for external dependencies.

But I think it's worth trying, so I'll open a PR for it. The running time of the CI test job has increased to about 20 minutes, and maybe this can speed things up. Even if not, it could help with that for reasons unrelated to caching, in that it would allow for more parallelism across runners.

That's exciting, let's try that!

However, it will still hash its output, even if regenerated from scratch, so multiple versions of the same crate will still be created.

Is this due to the effect of builds being nondeterministic, or something else?

It's deterministic, at least that's a goal, but what I am suggesting is that each code change will cause a new file to be created. The build output changes, and so does the hashed portion of the filename of the build artefact.

But I'd take my 'ideas' with a grain of salt as it's really just that, an idea on how this works based on using it. There is certainly more to it, and I have seen cache-trashing happening when switching between compiler versions in the same project, or when changing environment variables, which is something I'd expect not to be present if that hashing-story was entirely true.

hey are generous in the sense that new cache entries are created, and in that old entries are sometimes dropped gradually rather than immediately. But my understanding of the documentation is that, while it's fine for the cache to grow bigger than 10 GB from time to time, this could significantly worsen performance when it happens frequently:

Even that didn't happen here. Or if it happened, it happened too slowly. After all it seemed easy to generated caches larger than 10GB. After deleting the cache a couple of days ago entirely, it's now approaching 10GB already and I didn't work on gitoxide at all. I will keep watching.

EliahKagan · 2024-11-14T07:13:39Z

That's exciting, let's try that!

I hope to open a PR soon. In the mean time, I've opened the question #1673 to inquire about something that may help keep things clear in the changes in that forthcoming PR, but that I believe need not block that PR.

Edit: PR #1674

In the `justfile`, this renames the old `ci-test` recipe to `ci-test-full`, and has `ci-test` no longer clean the target directory nor run journey tests. The CI `test` job thus remains the same, but it does moderately less work. A new CI job, `test-journey`, is introduced to run the journey tests (still via `ci-test-journey` recipe). This change is intended to allow greater parallelism, and possibly make caches work better. The CI `test` job has sometimes been a few minutes slower than before, ever since 5173e9a (GitoxideLabs#1668). See comments in GitoxideLabs#1668 for some discussion on this change.

EliahKagan · 2024-11-14T09:38:23Z

Another setting shown in #1668 (review) that can probably be tightened is:

Allow GitHub Actions to create and approve pull requests

Currently we have no workflows that create or review pull requests. Furthermore, I believe this does not need to be enabled to allow Dependabot to open security updates and version updates, even if the setting to run Dependabot on GitHub Actions runners is enabled, so long as one is not doing something like implementing automatic merging of passing Dependabot pull requests, which we are not (and which I think we are unlikely to want here).

I don't think it's really a security issue, at least with workflows that trigger on the events these do, to have this enabled. But it can always be easily reenabled if GHA workflows that create or approve PRs are ever added.

Byron · 2024-11-14T09:48:36Z

Thanks for the hint - it's disabled now.

EliahKagan and others added 12 commits November 10, 2024 16:11

Name the steps of the test-pass CI job

20794ff

This gives `name:` keys to the `test-pass` steps, turning the first step's comment into its name. This way, the output can be as clear as the workflow file itself.

Refine the new check-blocking job

7b7a819

- Show the job lists for better debugging. - Reword the `diff` step name to be clearer (albeit less precise).

Allow the check-blocking job to pass (when it should)

eba50ae

Remove empty pull request template

4564a64

See GitoxideLabs#1667.

update all dependencies to help wtih cargo-deny

5173e9a

Byron force-pushed the run-ci/gha-permissions branch from 98f2c3d to 5173e9a Compare November 12, 2024 07:30

Byron approved these changes Nov 12, 2024

View reviewed changes

.github/workflows/ci.yml Show resolved Hide resolved

Byron enabled auto-merge November 12, 2024 07:49

This was referenced Nov 12, 2024

Split cargo-deny job into two non-matrix jobs #1670

Merged

[Test PR] Investigate "No space left on device" error (A) EliahKagan/gitoxide#3

Closed

[Test PR] Investigate "No space left on device" error (B) EliahKagan/gitoxide#4

Closed

EliahKagan mentioned this pull request Nov 12, 2024

Investigate "No space left on device" CI error on PRs #1671

Closed

Also clear the target before running journey tests.

132696d

That way, the runner hopefully doesn't run out of disk space while it is compiling the various targets used in a journey test.

Byron merged commit d47263b into GitoxideLabs:main Nov 12, 2024
18 checks passed

EliahKagan deleted the run-ci/gha-permissions branch November 12, 2024 10:56

EliahKagan mentioned this pull request Nov 12, 2024

A few more CI workflow comment and style improvements #1672

Merged

EliahKagan mentioned this pull request Nov 14, 2024

Split test CI job into test and test-journey #1674

Merged

EliahKagan mentioned this pull request Dec 18, 2024

radicle tuning #1726

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve CI `permissions`, auto-merge maintainability, and clarity #1668

Improve CI `permissions`, auto-merge maintainability, and clarity #1668

EliahKagan commented Nov 11, 2024 •

edited

Loading

EliahKagan commented Nov 12, 2024

Byron left a comment

EliahKagan commented Nov 12, 2024 •

edited

Loading

EliahKagan commented Nov 12, 2024 •

edited

Loading

Byron commented Nov 12, 2024

Byron commented Nov 12, 2024

EliahKagan commented Nov 12, 2024

EliahKagan commented Nov 12, 2024 •

edited

Loading

Byron commented Nov 12, 2024

EliahKagan commented Nov 12, 2024 •

edited

Loading

Byron commented Nov 13, 2024

EliahKagan commented Nov 13, 2024 •

edited

Loading

Byron commented Nov 14, 2024

EliahKagan commented Nov 14, 2024 •

edited

Loading

EliahKagan commented Nov 14, 2024 •

edited

Loading

Byron commented Nov 14, 2024

Improve CI permissions, auto-merge maintainability, and clarity #1668

Improve CI permissions, auto-merge maintainability, and clarity #1668

Conversation

EliahKagan commented Nov 11, 2024 • edited Loading

Use read-only github.token everywhere feasible

Style workflows more consistently and slightly more spaciously

Bump cargo-diet version and express the logic more cleanly

Provide richer log output for tests-pass

Detect new jobs that may be intended to block auto-merge

Make the wasm jobs fail the workflow but not block auto-merge

Update other GitHub-specific YAML configuration

EliahKagan commented Nov 12, 2024

Byron left a comment

Choose a reason for hiding this comment

EliahKagan commented Nov 12, 2024 • edited Loading

EliahKagan commented Nov 12, 2024 • edited Loading

Byron commented Nov 12, 2024

No space left on device

Byron commented Nov 12, 2024

EliahKagan commented Nov 12, 2024

EliahKagan commented Nov 12, 2024 • edited Loading

Byron commented Nov 12, 2024

EliahKagan commented Nov 12, 2024 • edited Loading

Byron commented Nov 13, 2024

EliahKagan commented Nov 13, 2024 • edited Loading

Byron commented Nov 14, 2024

EliahKagan commented Nov 14, 2024 • edited Loading

EliahKagan commented Nov 14, 2024 • edited Loading

Byron commented Nov 14, 2024

Improve CI `permissions`, auto-merge maintainability, and clarity #1668

Improve CI `permissions`, auto-merge maintainability, and clarity #1668

EliahKagan commented Nov 11, 2024 •

edited

Loading

Use read-only `github.token` everywhere feasible

Provide richer log output for `tests-pass`

Make the `wasm` jobs fail the workflow but not block auto-merge

EliahKagan commented Nov 12, 2024 •

edited

Loading

EliahKagan commented Nov 12, 2024 •

edited

Loading

EliahKagan commented Nov 12, 2024 •

edited

Loading

EliahKagan commented Nov 12, 2024 •

edited

Loading

EliahKagan commented Nov 13, 2024 •

edited

Loading

EliahKagan commented Nov 14, 2024 •

edited

Loading

EliahKagan commented Nov 14, 2024 •

edited

Loading