Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC 0056] CI all Nix PRs #56

Closed
wants to merge 5 commits into from

Conversation

Ericson2314
Copy link
Member

@Ericson2314 Ericson2314 commented Oct 31, 2019

Build all Nix PRs in CI. Do not merge any PR until it passes CI.

This has been discussed for Nixpkgs, but it should be more economically feasible for just Nix.

Rendered

@Ericson2314 Ericson2314 changed the title CI all nix PRs CI all Nix PRs Oct 31, 2019
@Ericson2314
Copy link
Member Author

Exhibits:

  1. Switch to nixpkgs 19.09 nix#3165 I have no idea if this works on all platforms. Those mentioned failing tests are from a different branch. We shouldn't need to extrapolate from different code being tested.

  2. Port Nix to Meson nix#3160 I happened to test Darwin by hand and notice the failures. If I hadn't, would the original author had known? Would have anyone else?

@Mic92 Mic92 changed the title CI all Nix PRs [RFC 0056] CI all Nix PRs Oct 31, 2019
@sondr3
Copy link

sondr3 commented Oct 31, 2019

So something like the Bors bot used in a lot of the Rust projects? I'd also love to see something like @ofborg merge-if-successful where after fixing feedback the PR would be merged if everything builds fine, right now a lot of PRs also languish after the reviewers gives thumbs up but then never merge the PR. There are a couple of issues already about this in ofborg, e.g. NixOS/ofborg#104 and NixOS/ofborg#112.

@edolstra
Copy link
Member

We can't use Hydra for this for security reasons. Maybe ofborg could do it. Or we could run a separate Hydra instance that doesn't have write access to cache.nixos.org.

@sondr3 We definitely shouldn't auto-merge PRs just because they build correctly. Just because something builds doesn't mean it's a good idea.

@sondr3
Copy link

sondr3 commented Oct 31, 2019

@edolstra not all PRs no, but a lot of the contributed PRs are simply minor or patch version bumps and most of them should be fine to auto merge after building in my opinion. As mentioned, this has already been brought up in ofborg, and is something I think would be useful (though probably out of scope for this RFC).

@Ericson2314
Copy link
Member Author

@sondr3 Remember this about Nix not Nixpkgs. Baby steps :)

@edolstra What security issues are you worried about? It's not clear to me.

@edolstra
Copy link
Member

@Ericson2314 Anybody could submit a PR that DoSes the Hydra evaluator or queue, or exploits a zero-day kernel bug to root the build machines.

@Ericson2314
Copy link
Member Author

Aren't those already problems with hydra building arbitrary stuff in Nixpkgs? I see no security difference between submitting arbitrary C++ and arbitrary Nix that can fetch external C++.

Anybody could submit a PR that DoSes the Hydra evaluator or queue

We have timeouts and other resource limits already, right?

Exploits a zero-day kernel bug to root the build machines.

Say a package in Nixpkgs has such a bug, doesn't use it during the ofborg build except to get a side-channel to tell whether it's the cache.nixos.org build, and poisons the results of the hydra.nixos.org build.

Am I missing something?

@edolstra
Copy link
Member

Hydra doesn't build arbitrary stuff, it builds stuff that has been merged so presumably a human has looked at it. It's ofborg that's exposed, but it doesn't have write access to cache.nixos.org so it doesn't really matter.

We have timeouts and other resource limits already, right?

There are some limits, but there is nothing preventing jobset evaluation consuming all memory (we already do a good job of that ourselves unintentionally...), or a jobset creating a billion builds in the database, or a lots of jobs that fork-bomb every build machine...

@Ericson2314
Copy link
Member Author

Ericson2314 commented Oct 31, 2019

Hydra doesn't build arbitrary stuff, it builds stuff that has been merged so presumably a human has looked at it.

I guarantee our reviewers will not reliably catch zero-days, full stop.

There are some limits, but...

Those DDOs problems are all fixable with not much effort. I wouldn't characterize them as a security issues because they do not compromise the store, and the availability of hydra is already....not great.

Given that those are low-impact problem, I would adopt this RFC as is, make some effort to limit resources better, and only if/when spam becomes a problem disable this policy until the issues are fixed.

@edolstra
Copy link
Member

and the availability of hydra is already....not great

If the availability is not great, that means it's not appropriate as a PR building tool (where you want quick feedback).

@Ericson2314
Copy link
Member Author

Ericson2314 commented Oct 31, 2019

I mean that hydra goes down a fair amount; when it's up it works better for building PRs with Nix than travis, or circle-ci, etc, for Nix. Hercules or something would do better than Hydra, but that's a separate conversation.

Basically, I'm saying that unless there is a serious security problem or something, we should CI the we want to regardless of hydra's fragility, rather than shifting the goalposts to avoid doing something about Hydra being fragile.

@Mic92
Copy link
Member

Mic92 commented Oct 31, 2019

Well, if it is just Nix almost every CI would do. Hydra, Github Actions, Travis CI. A simple travis.yml would be shorter than this RFC.

@Ericson2314
Copy link
Member Author

Ericson2314 commented Oct 31, 2019

@Mic92 Those are often timeout hell. Also, the YAML/whatever boilerplate for those is a huge maintenance burden because it is impossible to reproduce locally. Finally, I do want cached builds of PRs, not merely a CI green light. I think that's a great feature.

@Mic92
Copy link
Member

Mic92 commented Oct 31, 2019

@Ericson2314 This was the case for nixpkgs, but do you think they would timeout for nix as well? In nix's release.nix the nixpkgs 19.09 channel is used, so all dependencies should be build already unlike nixpkgs master. I extended the travis.yml to also build on linux: NixOS/nix#3187

Regarding local reproducibility, its just literally nix-build release.nix -A build.x86_64-linux - which should be reproducible enough :)
Regarding cached builds - do you really need a cache for nix builds? I mean it does not take that long to build locally. Otherwise one could just add cachix: https://github.com/nix-community/nur-packages-template/blob/master/.travis.yml#L31

@Ericson2314
Copy link
Member Author

Ericson2314 commented Oct 31, 2019

@Mic92

This was the case for nixpkgs, but do you think they would timeout for nix as well?

If I add cross compilation to windows in the windows PR, that might time out.

Regarding local reproducibility, its just literally nix-build release.nix -A build.x86_64-linux - which should be reproducible enough :)

Yes that part is reproducable, but the wrapper YAML does other stuff and may fail or through away cache. That is hard to debug. Perhaps you've had better luck but I've seen the mainstream CIs waste a lot of developer time and am quiet sick of them overall.

Regarding cached builds - do you really need a cache for nix builds? I mean it does not take that long to build locally. Otherwise one could just add cachix: https://github.com/nix-community/nur-packages-template/blob/master/.travis.yml#L31

Yes, for the cross windows case, and other times we are hitting things that the underlying nixpkgs didn't cache. We could use cachix but I think the NixOS foundation should be consistent across projects; same CI and caching software. It's just simpler to maintain.


But to be clear all the things you say are still better to me than the status quo.

@7c6f434c
Copy link
Member

7c6f434c commented Nov 1, 2019

@Ericson2314 there are some specific profitable resource abuse options that do not require using (and therefore risking) any zero-days exploits (but can be abused with auto-builds)

@matthewbauer
Copy link
Member

matthewbauer commented Nov 1, 2019

This was the case for nixpkgs, but do you think they would timeout for nix as well?

If I add cross compilation to windows in the windows PR, that might time out.

Maybe we need a way in Nix to disallow uncached dependencies. Some PR to Nix shouldn't require building more than what's within Nix. Users should provide their own cached version if they need to build more. This limits the potential cost of CI without arbitrary rate limiting.

@zimbatm
Copy link
Member

zimbatm commented Nov 1, 2019

Well, if it is just Nix almost every CI would do. Hydra, Github Actions, Travis CI. A simple travis.yml would be shorter than this RFC.

The full Nix test suite depends on KVM to run integration tests, which is typically not supported by most build environments. Another requirement is to support both Linux and Darwin. And hopefully Windows in the future. If the idea is to keep Hydra always green it might be best just to go with Hydra, even if it's a separate setup.


There is a (famous blog post)[blog-post] that everyone is sloppy and doing CI wrong.
This isn't just bad for releasing software smoothly, but also increases the burden for new contributors because it is harder to judge the correctness of PRs at a glance (is it broken? Did I break it?).
I personally find it harder to contribute, I have to worry about double checking all my work on platforms I don't have as-easy access to, like Darwin.
Copy link
Member

@zimbatm zimbatm Nov 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another motivation is that broken tests are not always the author's fault as master often turns red. Meaning that the author now has to checkout master to see this is a new error or not. If master is red, the author now has to either:

a. find a working commit in the history. If there are any merge conflict this might not be an option.
b. try and fix master himself. This might be completely unrelated work, meaning the author has to load and understand a new context.
c. wait or pester other contributors to fix master so they can continue their work.

Basically the later an error gets caught, the more expensive it becomes to fix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have something like "rocket science" CI if you treat "nixpkgs-channels" as the source code. The channels will only update if the branch passes all of its blocking tests. This accomplishes the same goal as mentioned in the post and avoids the crazy costs that come with "rocket science" CI.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with the way we use Hydra for nixpkgs-channels is that if a broken commit is ever merged to nixpkgs, everything that gets merged after the broken commit also gets blocked from making it to nixpkgs-channels until the problem is fixed.

Don't get me wrong: It's definitely much better than some "CI" systems I've used, but it's not quite as good as the first one I used, which only blocks the broken commit until it's fixed, while allowing other (passing) commits to be merged.

@edolstra
Copy link
Member

edolstra commented Nov 1, 2019

Maybe we need a way in Nix to disallow uncached dependencies.

That wouldn't be really desirable, because it's pretty reasonable to change dependencies in a PR. E.g. our release.nix has

(aws-sdk-cpp.override {
  apis = ["s3" "transfer"];
  customMemoryManagement = false;
})

which you might want to change in a PR.

# Detailed design
[design]: #detailed-design

Set up Hydra declarative jobsets to build all Nix PRs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having used these in the past, they rarely work well. I think the Hydra model only really works for "post-merge" CI, not "pre-merge" CI. A tool like OfBorg (or something else) seems like a good solution to "pre-merge" CI.

@edolstra
Copy link
Member

edolstra commented Nov 7, 2019

Nominating myself for the RFC shepherd team.

@nh2
Copy link
Contributor

nh2 commented Nov 11, 2019

My opinion:

  • Yes, nix PRs should have automatic CI.
  • There are 2 safe ways to CI non-trusted-contributor PRs:
    • Run them on "free" infrastructure that tests in VMs with timeouts (like TravisCI, CircleCI etc).
    • Have some trusted contributer assert that there's nothing malicious in the PR (the ofborg model).

I think we should get the above working for nix before considering any form of automatic merging.

This is so that we get something done fast, and because clicking merge and reviewing small changes for maliciousness is easy, but and building and testing is what takes time. If we can automate just the latter for the beginning, we already win a lot.

@Mic92 Mic92 added status: in discussion and removed status: open for nominations Open for shepherding team nominations labels Dec 13, 2019
@Mic92
Copy link
Member

Mic92 commented Dec 13, 2019

We just need to decide on a leader.

@zimbatm
Copy link
Member

zimbatm commented Dec 16, 2019

leader=rand(4) :)

I think it's pretty obvious that this is a good thing to have. Even if it doesn't build the nested VMs tests, any CI is better than the current situation. I already did 3 different implementations with Travis, Circle CI and Azure Pipeline. Might as well to a 4th one with Hercules CI if we have the green light.

@domenkozar
Copy link
Member

@matthewbauer would you mind to be the leader?

@matthewbauer
Copy link
Member

@matthewbauer would you mind to be the leader?

Yeah I can accept!

@matthewbauer
Copy link
Member

We should try to schedule a short meeting to get everyone on the same page.

@matthewbauer
Copy link
Member

matthewbauer commented Jan 9, 2020

RFC Notes for 2019/01/09

  • We can setup Ofborg to build old prs. Ideally this would build all of release.nix.
  • Graham agrees Ofborg can be used for CI, but Hydra may actually be a better fit. Ofborg does not have a way of caching.
  • One concern of Ofborg is that logging gets mixed together in a single derivation.
  • Eelco would like to minimize number of CI.
  • Hydra has a github pr plugin that could be utilized, but it may take extra work to setup.
  • Ofborg sounds like a good option, but we have a few concerns. Eelco would like an instance just for Nix, so that PRs from Nix don’t have to wait on Nixpkgs PRs.
  • Ofborg merges commit to master before testing, which is good. This doesn’t meet the rocket science requirement but ofborg does let you retrigger each time and that can be done when there are few things to do.
  • Eelco and Graham talked about a potential issue in building the source tarball in release.nix with OfBorg. Some solutions are available though, namely using different platforms to build each tarball.
  • Domen discussed using Hercules which solves the above problem, but Hercules being closed source is probably a blocker.
  • Ofborg was decided as the CI to use. We all agree this is a step in the right direction. Graham expects it to be immediately usable for this purpose. Graham will be working on this tomorrow.

@edolstra
Copy link
Member

edolstra commented Jan 9, 2020

Thanks for the notes!

Eelco would like an instance just for Nix, so that PRs from Nix don’t have to wait on Nixpkgs PRs.

I don't want a separate instance. It might be desirable though if wait times are high due to Nixpkgs PRs. However according to Graham this probably won't be a problem.

@Ericson2314
Copy link
Member Author

I'm sorry i missed the email thread on this completely. Glad to here a choice was made though, and @grahamc is unblocked. I am happy to revise this to reflect the consensus, or whatever else you all would like to see this document turn into. (Do we need it at all? Maybe a general CI policy is more worth codifying than a one-off-change for just Nix.)

@edolstra
Copy link
Member

edolstra commented Jan 9, 2020

Yeah, I don't think we really need to codify this just for Nix. We're all in agreement that we want to CI all Nix PRs.

@Ericson2314
Copy link
Member Author

Ericson2314 commented Jan 9, 2020

@edolstra Agreed, there is no other decision a Nix-only RFC would affect, so it's not worth writing down. What do you think for other NixOS repos like cabal2nix and what not, should we ofborg all of them? That would be worth writing up.

@edolstra
Copy link
Member

edolstra commented Mar 5, 2020

Okay, I'll close this PR. Yeah, CI for other repos would certainly be good.

@gilligan
Copy link

gilligan commented Mar 6, 2020

Hey, can I ask what the status on this is?

I see it got closed since everyone is agreeing but.. It would be really nice to go for some quick-win here that works and can be used now-ish and then iterate on that.

I just wanted to try to hack something on Nix and realized several tests are failing on master. So it would be really nice to have something up and running even when it isn't the perfect solution. After all "perfect is the enemy of good" ;)

@edolstra
Copy link
Member

edolstra commented Mar 6, 2020

We agreed to enable ofborg on the Nix repo to test PRs. I think this requires some changes to ofborg though (cc @grahamc).

@Ericson2314
Copy link
Member Author

Yes it would be nice to have issue(s) somewhere tracking the progress of ofborg-ing various repos.

@roberth
Copy link
Member

roberth commented Mar 8, 2020

I'd be happy to help out with Hercules CI.
Hercules CI can build the nix repo by add a ci.nix file, which works like nix-build. Here's an example result.
If you do want to use existing builders, Hercules CI supports a workflow where a review is required before building. Building directly on separate hardware is also an option.
For caching, Hercules CI can use Cachix.
If there's interest beyond the nix repo, users with write access to such repos can enable/disable Hercules CI for those. Alternatively, organization admins can force Hercules CI to be restricted to certain repos only.

but Hercules being closed source is probably a blocker.

If this is about vendor lock-in, there isn't any. Hercules' ci.nix is nix-build-able, so the cost of switching is microscopic. If it's ideological, well, the agent is open source and you're free to do anything. I don't want to force Hercules CI on any project.

@grahamc did you make any progress?

If anyone else thinks it's worth at least trying, these are the steps for adding hercules:

  1. a NixOS GitHub admin authorizes the GitHub App. (non-admins can request but not authorize)
  2. if the choice is for a review-first workflow, also for bors
  3. deploy build agents

Should be easy, but I am happy to help with the deployment and updates of the build agents for the NixOS org, besides providing user support.

@edolstra
Copy link
Member

edolstra commented Mar 9, 2020

IIRC, we did discuss Hercules CI in the last RFC meeting, but decided to go with ofborg.

@domenkozar
Copy link
Member

domenkozar commented Mar 11, 2020

When we had the meeting 2 months ago, the assumption was that @grahamc can code this up in the folllowing weekend.

Is there a new estimate?

@domenkozar
Copy link
Member

domenkozar commented Mar 13, 2020

I created NixOS/nix#3409 that:

  • creates no additional vendor lock-in
  • adds value (the build currently fails)
  • is minimal to maintain since I already maintain github actions for Nix
  • addresses @gilligan concern that I also share

@domenkozar
Copy link
Member

Happy to report NixOS/nix#3409 has been merged!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.