-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ci] remove Travis (fixes #3519) #3672
Conversation
@guolinke could you add If we want to continue running Linux CI jobs in this container (https://github.com/guolinke/lightgbm-ci-docker), we have to be able to run a docker daemon on VMs in that pool. |
@jameslamb I'd like to preserse the stategy we used with Travis for duplicated Python tests: oldest possible OS + default compiler for Azure and newest available OS + non-default compiler for Travis. WDYT about this? Can we trasfer "newest available OS + non-default compiler for Travis" to new Azure pools? |
One hard thing here is that I think (based on https://docs.microsoft.com/en-us/azure/devops/pipelines/yaml-schema?view=azure-devops&tabs=schema%2Cparameter-schema#pool) that the VM image used for these new user-managed pools will be frozen, and can't be something dynamic like For linux, it's fairly simple to switch OS using Docker. So I think it could work like this:
For Mac, we don't have a group of self-hosted runners (#3519 (comment)), so will have to make it work with Microsoft-hosted ones
Using this strategy, nothing would have to be manually changed in the self-hosted runners. I think this can work, especially remembering that there are fewer Mac jobs than Linux Jobs, |
@jameslamb It use the dynamically created VM, so we cannot pre-install docker. |
It seems the only solution is to use cloud-init when create vm/vmss |
Oh ok, I'm surprised that it isn't possible to create a custom image with any software you want, like you can do with AWS AMIs (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AMIs.html#creating-an-ami). Based on your comments above, I can try customizing the init. Thanks! |
Sounds great!
@jameslamb @guolinke |
I'll try this first, let's see how much longer the CI is. I agree with your earlier comments that at our pace of development, extra CI time isn't a huge problem. |
the sh-ubuntu seems to work now. you can have a try. |
@guolinke can you give me permissions in Azure DevOps to cancel / re-try builds for the |
Ok @StrikerRUS can you take a look? I'd like to hear your suggestions. I currently have the following setup on Azure (Windows excluded because it's unchanged by this PR).
Most jobs seems to be passing, but I think this setup will be very slow.
|
I don't think the capacity / limits problems are specific to this PR's changes. I started https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=8312&view=results (for #3688) 15 minutes ago and most builds have not started yet |
Hi @jameslamb Can you give me a microsoft account, so that I can add you? The self-hosted agents are created dynamically, so it may need more time to initialize. |
I like the list you've provided. 👍
I still think that coverage is more important than testing time.
If I'm not mistaken, we are limited by 10 free parallel jobs. Maybe any other PR was being built in the same time as yours. Could you please remove all occurrences of the
|
Seems we need OpenMP |
yep you're right! didn't catch it before because I was using One last test is failing, and I'm unsure what to do.
This is extra weird because I didn't touch that job in this PR, and it's been passing on other PRs and on @StrikerRUS have you seen this before or have any ideas what I can try? |
@jameslamb So weird! I expected failing new GPU job with Clang (refer to #3475), but not the old one with gcc! But the symptoms look very similar to the linked issue... |
yeah I'm pretty confused. The thing that's weirding me out the most is that this isn't even one of the jobs that is being moved over from Travis. It shouldn't be affected by this PR at all. I'll double-check my changes in I do see in the logs that the failing job is using gcc as expected 😕
|
Does running tests on self-hosted pool of agents mean that we actually use the same machine each time? I'm afraid that environment changes done to run Docker with latest Ubuntu caused some conflicts. |
😫 I think that's possible. Maybe when you use containers, Azure DevOps schedules multiple jobs into the same VM, assuming they're isolated? I just looked in the setup logs, and I see several volumes being mounted in, which is one way that information could leak between jobs. See the "initialize containers" step in https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=8510&view=logs&j=b463fb3a-b487-5cfc-fc07-6d216464ba86&t=c3cec64d-f953-425d-a2c4-95244904eddb. /usr/bin/docker create \
--name ubuntu-latest_ubuntulatest_dd8aa1 \
--label b0e97f \
--network vsts_network_571a4ba0d03d4c4cbd4e55cf6ef595c3 \
--name ci-container \
-v /usr/bin/docker:/tmp/docker:ro \
-v "/var/run/docker.sock":"/var/run/docker.sock" \
-v "/agent/_work/1":"/__w/1" \
-v "/agent/_work/_temp":"/__w/_temp" \
-v "/agent/_work/_tasks":"/__w/_tasks" \
-v "/agent/_work/_tool":"/__t" \
-v "/agent/externals":"/__a/externals":ro \
-v "/agent/_work/.taskkey":"/__w/.taskkey" \
ubuntu:latest \
"/__a/externals/node/bin/node" -e "setInterval(function(){}, 24 * 60 * 60 * 1000);" If something like CMake's cache was getting written to one of those directories, details from one job could sneak into another. |
Oh, Azure is so lacking ability to cancel jobs! 😢 I tried to skip tests and run our examples. Line 172 in 78d31d9
and changed regular to gpu in this piece of code:Lines 174 to 196 in 78d31d9
Here is the output:
So weird! |
my first guess is that the runners in that pool are smaller (less available memory / cpu / bandwidth) that the microsoft-hosted ones. I'm not sure how to check that though. |
oh awesome! I just made this change in 3a3c845 |
Then it is strange that only one step related to Docker is suffering... Also, Microsoft-hosted agents don't look very powerful
Maybe again some downloading issues like #3682?.. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Excellent PR!
By comparing logs before and after migration to self-hosted agents, I can see only two possible causes of 2x longer runs of
|
I'm surprised by this, because the Azure docs say that Docker layer caching isn't available for Microsoft-hosted agents: https://docs.microsoft.com/en-us/azure/devops/pipelines/ecosystems/containers/build-image?view=azure-devops#is-reutilizing-layer-caching-during-builds-possible-on-azure-pipelines. And I expected that having a dedicated pool of agents would probably mean we'd get layer caching for free. I'm really not sure :/ |
Thanks for the reviews @StrikerRUS and for setting up these self-hosted runners @guolinke . We have more things to tinker with in LightGBM's CI, but overall I think this should improve the stability. |
Just for the record.
Almost the same issue here. But OP there didn't change anything in config. Maybe different geo regions?.. Lets hope this slowness will be resolved without our help. |
Thanks, I hope it's just something like that! Very frustrating that support just closed that ticket as "Closed - not a bug" 😭 |
oh I see, ok |
Related to slow Docker pulls: https://david.gardiner.net.au/2020/05/docker-perf-on-azure-pipelines.html. |
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
This is a draft PR to move CI jobs from Travis to Azure DevOps.This PR moves remaining Mac + Linux jobs that are currently running on Travis to GitHub Actions. This project is ending its reliance on Travis based on Travis's strategic decision to offer only very very limited support for open source projects. See #3519 for full background and discussion.