-
-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI Job Contention #418
Comments
Just want to mention, working on this now, will have a PR and writeup in a bit 👌 |
@CleanCut thanks for bringing this up, it's something I've kept an eye on, but haven't made a discussion for 😄 I'd love to go over my thought process, and hear about anything I might not be thinking about 👌 There are two different sides to optimize for. First, we could optimize for fewer jobs for each commit. If we had all of CI work in just 3 jobs as you suggest, then you could have more commits being tested at once. On the other hand, you could optimize for parallel jobs, which can reduce the total amount of time a commit will take to pass CI. I've optimized for the latter. After examining your suggested optimizations, I got some results. The times might not hold up in the wild, as this is just n=1. A potential downside of combining the jobs is that it's more difficult to tell if a test failed, or if a check failed. Originally one of the reasons I decoupled them into different jobs was that I wasn't sure how their caches would clash before I looked deeper into it. It seems to be quite negligible:
Also to mention, I believe the frequency of commits to Bevy isn't as numerous and clustered as you assume. If we take a look at the actions tab, we can see that for the most part, commits are spaced out. I think it would be fair to say that it is less likely that multiple commits are running at the same time. If this occurs, as long as each is using fewer than 10 jobs, they should be able to work concurrently. Extending this, I believe it to be quite unlikely that 3 commits are running at the same time. However, to adapt to this, we can try and optimize for 6 jobs running concurrently for a commit. This would allow 3 commits to be tested at once. To do this, we will use the 3 operating systems, all testing on nightly and stable. This would be our 6 concurrent jobs. We then leave So tl;dr a few steps that I have taken instead of bundling the jobs entirely:
Side note: Something is stopping Macos + nightly from working properly. I've disabled it for now, and I'm going to look into why it might be happening on the cargo side. I made #380 for this already, but it's quite inconsistent. Anyways, I made quite a post about this 😅 would love to hear what you think 💯 |
👍 I like the plan. Given that jobs fail-fast (one job failing cancels all the others) I'm a little on the fence whether it is worth it to put the clean job first as a separate dependent job, rather than just pick one of the 6 jobs and run the clean as the first step. We could certainly give it a try and see how it works out in practice. In any case, I think this plan is a strict improvement over what we have now. 😄 There is one additional point to discuss: The limit of 5 concurrent mac jobs means that with our 2 mac jobs we will never be able to run 3 full CI runs concurrently. Is that worth optimizing for? I see four paths:
Which way we choose probably isn't real important, because it's easy to try another option if we're not happy with how things turn out. With that in mind, I recommend option 1 (ignoring the problem) and see if that's good enough. 😉 Thanks for taking this on! ❤️ |
Yup I like the plan. I also think that failing fast means that we might not actually need to worry about dependent jobs. I also think adding bors is an important next step so we can maintain the integrity of our checks. I think thats something I'll need to add, but if anyone has experience and could give me the run down that would be great. In regard to this issue, I think we have agreement that should be able to merge #423? |
#423 implements "the plan" + option 2, but leaves in the A later PR could combine the |
There are two main advantages to using bors:
Most projects run Configuring |
Brilliant. Thanks! |
Just to add on, #423 only removes Macos nightly because there is an issue in building it that I need to track down. This will require further examination into which of the 4 choices will work well. I still believe that option one has potential, so I'll keep an eye on the job frequency and see if I can pull together some stats. |
I trust @AngelOnFira can take it from here. I moved the |
The way I phrased my information in #129 (comment) may have led to the conclusion that GitHub Actions are unlimited in every way for open source projects. It is true that the total number of minutes does not have a limit. It is not true that the number of concurrent jobs is unlimited.
There are two limits on the number of concurrent jobs for free, open source plans:
We're hitting at least the first limit. Due to the proliferation of the number of jobs in a CI run (#357, #373, etc.), combined with the high level of activity on this project, a number of jobs end up waiting in the queue for a slot to open for them to run.
For example, take a look at this run. If it had been able to launch all the jobs simultaneously, it would have completed in 45 seconds because clippy quickly failed. But the clippy job didn't get a chance to start for ~6 minutes because of other concurrent jobs from other pull requests.
Currently there are 9 jobs defined. That means we can only have 2 concurrent CI runs before we start piling up delays. I would suggest, at a minimum, to collapse 3 of the jobs (2 tests, 1 clean) into 2 of the build jobs, for a total of 6 jobs. Tests already have to build the code, so there shouldn't be any reduction in what CI is checking there. Adding the "clean" stuff on top of one of those may increase the max runtime of that particular job a little bit, but if we pick the job which completes the fastest on average, we may avoid increasing the max job time at all. Even if we do increase max job time by a few seconds, the overall increase in throughput is a clear win.
If no one else beats me to it, I'm willing to make a PR for this later this week. I won't complain if someone else would rather do it, though. 😄
/cc @AngelOnFira
The text was updated successfully, but these errors were encountered: