-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] Toward CI infrastructure as code #4958
Comments
Update: one of the contributors to MXNet CI, @larroy, strongly advised against the use of Terraform. It's better to use cloudtools/troposphere |
I made some updates. Recently, I've grown skeptical about the whole "infrastructure as code" idea. Tools like Terraform, Troposphere, or AWS CloudFormation may automate the setup of CI, but they also introduce a specialized abstraction layer. The XGBoost project doesn't have anyone who specializes in devops, and all the current contributors (including me) would rather write code and fix bugs than learning a whole new devops language and tools. (*) This situation is in contrast with other projects like MXNet and RAPIDS, where a team (!) of expert sysadmins take care of ops. I went ahead and documented the detailed steps it takes to set up a functioning Jenkins CI server in hcho3/xgboost-devops. One of the touted benefit of "infrastructure as code" is documentation, and the documents in xgboost-devops will serves this goal somewhat. (*) I accidentally landed myself into "the devops person" of the XGBoost project, but no, I am not a sysadmin or SDET by training. All the knowledge I have about CI and Jenkins I obtained through googling, and lots of trial and error. I came to develop huge appreciation for sysadmins and SDETs. They solve hard problems and tackle unexpected system errors every single day. |
I think we should still try to automate the generation of machine images. Time to pick up some Powershell. Also AWS provides EC2 Image Builder which automates the generation of custom AMIs. |
Why don't you use github actions? |
We are in the process of setting it up. |
Now we have an automated script to setup workers: https://github.com/hcho3/xgboost-devops/blob/d4c8d42061254202ca9b9c4ec0be0c9db0624b40/worker_image_builder/windows_worker_setup.ps1 |
Completed in #8142 |
#4401 and #4463 brought major improvement to continuous integration (*), by adding a whole new suite of platforms and targets (especially NVIDIA GPUs). Now we have automated nightly builds for Windows and Linux platforms, and all capable of running GPU algorithms (
gpu_hist
) out of box. This was made possible because we self-host a CI software known as Jenkins, which allows for greater customization, such as running GPU workers (**). However, the use of self-hosted CI has brought a few endemic problems:Instability in the provision of EC2 workers. Recently, quite a few tests timed out because Jenkins somehow could not allocate GPU agents in time, e.g. PR 4876, PR 4945. The cause is likely to be a problem with the EC2 plugin, since I do not see any error reported in the CloudTrail.I haven't seen this problem for a while. The EC2 plugin mostly works and launches new workers in response to incoming test jobs.Lack of financial control. Currently, there is no mechanism to control spending, and I have to manually monitor the AWS billing page to enforce the budget (***). What if someone submits a pull request containing bitcoin mining code and keeps force-pushing 12 times in a row?Fixed in [CI] Enforce daily budget in Jenkins CI #5884All in all, these endemic issues waste developers' time and adds yet another obstacle to open source contributions (as if contributors didn't need another problem!). To address them, I'd like to revamp the CI as follows:
Infrastructure as Code. All cloud configuration should be expressed as code. ThenCI configuration is now reproducible and fully documented.Contributors can now make changes to the infrastructure (subject to approval from the PMC).We can review every configuration change with a code review.The CI for Apache MXNet is a good example: https://github.com/apache/incubator-mxnet-ci. It utilizes Terraform.I'm now skeptical of the "Infrastructure as Code" idea. The concept proves to be useful in other situations and places, but not for the XGBoost project as it is now. See my comment below.Custom auto-scaling logic. Instead of using the EC2 plugin in Jenkins, we should implement a custom logic for provisioning EC2 workers. We can then debug any problem in EC2 provision and fix it. Additionally, we can transparently enforce financial budget via commits and pull requests. Again, I refer to https://github.com/apache/incubator-mxnet-ci for an example.I decided not to pursue this route. The EC2 plugin is good enough, and has grown more stable over time. I found out that custom auto-scaler is not necessary for financial control. See [CI] Enforce daily budget in Jenkins CI #5884. Reading the autoscaler code from https://github.com/apache/incubator-mxnet-ci convinced me of the sheer complexity of provisioning EC2 workers. This is also why the EC2 plugin in Jenkins is as large as it is.Automated management of machine images. Currently, I manually configure all Windows workers, because I cannot use Docker (****) So I spent hours installing different versions of Visual Studio and CUDA. Since the process is so tedious, I don't get to perform system updates in the Windows workers, so they get out-dated over time and become vulnerable to security holes. I'd like to automate the creation of Windows workers. See https://github.com/apache/incubator-mxnet-ci/tree/master/services/jenkins-slave-creation-windows.
(*) Here, continuous integration refers to the practice of using an automated test farm in order to validate every code change (in our case, pull requests).
(**) If GPU support was not a requirement, we could just use managed CI services such as Azure Pipelines. Unfortunately, to my knowledge, there is no managed CI solution that offers GPU workers.
(***) I have yet to compile a regular budget/spending report. Mea culpa.
(****) NVIDIA docker is unavailable for Windows: NVIDIA/nvidia-docker#429
@dmlc/xgboost-committer
The text was updated successfully, but these errors were encountered: