[CI] Toward CI infrastructure as code #4958

hcho3 · 2019-10-17T20:20:22Z

#4401 and #4463 brought major improvement to continuous integration (*), by adding a whole new suite of platforms and targets (especially NVIDIA GPUs). Now we have automated nightly builds for Windows and Linux platforms, and all capable of running GPU algorithms (gpu_hist) out of box. This was made possible because we self-host a CI software known as Jenkins, which allows for greater customization, such as running GPU workers (**). However, the use of self-hosted CI has brought a few endemic problems:

Managing the CI infrastructure is a lot of manual work. There are many running parts in the cloud platform (AWS) that need to be configured "just right" in order for the CI to work. It is easy to break some configuration and make the CI inoperable. Example: Security group, AMIs. Docker makes some parts of this task easier (package installation is automated), but other configuration is still manual. Even if problems are found and fixed by manual intervention, the diagnosis and fix are not documented, so the same problem may occur in the future.
"The Bus Factor". Currently, the management of self-hosted CI infrastructure (https://xgboost-ci.net) is wholly dependent on me, because a large part of the infrastructure is manually configured. That means that there's no one else who can manage the CI if I am unavailable for some reason. I'd like to be able to delegate management to trusted contributors.
Instability in the provision of EC2 workers. Recently, quite a few tests timed out because Jenkins somehow could not allocate GPU agents in time, e.g. PR 4876, PR 4945. The cause is likely to be a problem with the EC2 plugin, since I do not see any error reported in the CloudTrail. I haven't seen this problem for a while. The EC2 plugin mostly works and launches new workers in response to incoming test jobs.
Lack of financial control. Currently, there is no mechanism to control spending, and I have to manually monitor the AWS billing page to enforce the budget (***). What if someone submits a pull request containing bitcoin mining code and keeps force-pushing 12 times in a row? Fixed in [CI] Enforce daily budget in Jenkins CI #5884

All in all, these endemic issues waste developers' time and adds yet another obstacle to open source contributions (as if contributors didn't need another problem!). To address them, I'd like to revamp the CI as follows:

Infrastructure as Code. All cloud configuration should be expressed as code. Then
- ~~CI configuration is now reproducible and fully documented.~~
- ~~Contributors can now make changes to the infrastructure (subject to approval from the PMC).~~
- ~~We can review every configuration change with a code review.~~
~~The CI for Apache MXNet is a good example: https://github.com/apache/incubator-mxnet-ci. It utilizes Terraform.~~ I'm now skeptical of the "Infrastructure as Code" idea. The concept proves to be useful in other situations and places, but not for the XGBoost project as it is now. See my comment below.
Custom auto-scaling logic. Instead of using the EC2 plugin in Jenkins, we should implement a custom logic for provisioning EC2 workers. We can then debug any problem in EC2 provision and fix it. Additionally, we can transparently enforce financial budget via commits and pull requests. Again, I refer to https://github.com/apache/incubator-mxnet-ci for an example. I decided not to pursue this route. The EC2 plugin is good enough, and has grown more stable over time. I found out that custom auto-scaler is not necessary for financial control. See [CI] Enforce daily budget in Jenkins CI #5884. Reading the autoscaler code from https://github.com/apache/incubator-mxnet-ci convinced me of the sheer complexity of provisioning EC2 workers. This is also why the EC2 plugin in Jenkins is as large as it is.
Automated management of machine images. Currently, I manually configure all Windows workers, because I cannot use Docker (****) So I spent hours installing different versions of Visual Studio and CUDA. Since the process is so tedious, I don't get to perform system updates in the Windows workers, so they get out-dated over time and become vulnerable to security holes. I'd like to automate the creation of Windows workers. See https://github.com/apache/incubator-mxnet-ci/tree/master/services/jenkins-slave-creation-windows.

(*) Here, continuous integration refers to the practice of using an automated test farm in order to validate every code change (in our case, pull requests).
(**) If GPU support was not a requirement, we could just use managed CI services such as Azure Pipelines. Unfortunately, to my knowledge, there is no managed CI solution that offers GPU workers.
(***) I have yet to compile a regular budget/spending report. Mea culpa.
(****) NVIDIA docker is unavailable for Windows: NVIDIA/nvidia-docker#429

@dmlc/xgboost-committer

The text was updated successfully, but these errors were encountered:

hcho3 · 2019-11-27T00:24:59Z

Update: one of the contributors to MXNet CI, @larroy, strongly advised against the use of Terraform. It's better to use cloudtools/troposphere

hcho3 · 2020-07-16T08:43:38Z

I made some updates.

Recently, I've grown skeptical about the whole "infrastructure as code" idea. Tools like Terraform, Troposphere, or AWS CloudFormation may automate the setup of CI, but they also introduce a specialized abstraction layer. The XGBoost project doesn't have anyone who specializes in devops, and all the current contributors (including me) would rather write code and fix bugs than learning a whole new devops language and tools. (*) This situation is in contrast with other projects like MXNet and RAPIDS, where a team (!) of expert sysadmins take care of ops.

I went ahead and documented the detailed steps it takes to set up a functioning Jenkins CI server in hcho3/xgboost-devops. One of the touted benefit of "infrastructure as code" is documentation, and the documents in xgboost-devops will serves this goal somewhat.

(*) I accidentally landed myself into "the devops person" of the XGBoost project, but no, I am not a sysadmin or SDET by training. All the knowledge I have about CI and Jenkins I obtained through googling, and lots of trial and error. I came to develop huge appreciation for sysadmins and SDETs. They solve hard problems and tackle unexpected system errors every single day.

hcho3 · 2020-07-16T08:50:08Z

I think we should still try to automate the generation of machine images. Time to pick up some Powershell. Also AWS provides EC2 Image Builder which automates the generation of custom AMIs.

larroy · 2020-07-17T20:26:28Z

Why don't you use github actions?

hcho3 · 2020-07-17T20:35:51Z

We are in the process of setting it up.

hcho3 · 2020-07-18T10:33:49Z

Now we have an automated script to setup workers: https://github.com/hcho3/xgboost-devops/blob/d4c8d42061254202ca9b9c4ec0be0c9db0624b40/worker_image_builder/windows_worker_setup.ps1

hcho3 · 2022-09-14T20:32:06Z

Completed in #8142

hcho3 mentioned this issue Nov 23, 2019

[CI] Jenkins is down #5061

Closed

trivialfis added the type: roadmap label Dec 3, 2019

StrikerRUS mentioned this issue May 29, 2020

[ci] make MiKTeX downloads more reliable microsoft/LightGBM#3124

Merged

hcho3 closed this as completed Jul 18, 2020

StrikerRUS mentioned this issue Oct 1, 2020

setup CUDA CI job microsoft/LightGBM#3424

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Toward CI infrastructure as code #4958

[CI] Toward CI infrastructure as code #4958

hcho3 commented Oct 17, 2019 •

edited

Loading

hcho3 commented Nov 27, 2019

hcho3 commented Jul 16, 2020 •

edited

Loading

hcho3 commented Jul 16, 2020

larroy commented Jul 17, 2020

hcho3 commented Jul 17, 2020

hcho3 commented Jul 18, 2020

hcho3 commented Sep 14, 2022

[CI] Toward CI infrastructure as code #4958

[CI] Toward CI infrastructure as code #4958

Comments

hcho3 commented Oct 17, 2019 • edited Loading

hcho3 commented Nov 27, 2019

hcho3 commented Jul 16, 2020 • edited Loading

hcho3 commented Jul 16, 2020

larroy commented Jul 17, 2020

hcho3 commented Jul 17, 2020

hcho3 commented Jul 18, 2020

hcho3 commented Sep 14, 2022

hcho3 commented Oct 17, 2019 •

edited

Loading

hcho3 commented Jul 16, 2020 •

edited

Loading