Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] End-of-year CI budget report: cost exceeded projection #5176

Closed
hcho3 opened this issue Dec 31, 2019 · 8 comments · Fixed by #5884
Closed

[CI] End-of-year CI budget report: cost exceeded projection #5176

hcho3 opened this issue Dec 31, 2019 · 8 comments · Fixed by #5884

Comments

@hcho3
Copy link
Collaborator

hcho3 commented Dec 31, 2019

@dmlc/xgboost-committer

In March 2019, AWS graciously granted us 12k USD worth of AWS credits to maintain our CI server (https://xgboost-ci.net). In addition, sponsors including NVIDIA are committed to monthly donation through Open Source Collective. Since then, we revamped our CI (#4234) and added test coverage for more platforms and targets (CUDA, multi-GPU, dask, Spark).

Now here is the end of year report. I had estimated the cloud cost to be 1000 USD/month, but thanks to active contributions over the year, the average cost has been greater, at 1600 USD/month. As a result, the 12k AWS credit ran out 3 months earlier than what I expected. See the following table:

Month Cost covered by AWS credit Cost covered by donations Total spend
Mar-19 $                               709.58 $                                       - $     709.58
Apr-19 $                            1,269.62 $                                       - $  1,269.62
May-19 $                            1,985.30 $                                       - $  1,985.30
Jun-19 $                            1,608.34 $                                       - $  1,608.34
Jul-19 $                            1,522.61 $                                       - $  1,522.61
Aug-19 $                               290.43 $                            1,383.86 $  1,674.29
Sep-19 $                            1,038.17 $                                       - $  1,038.17
Oct-19 $                            2,204.07 $                                       - $  2,204.07
Nov-19 $                            1,304.33 $                                       - $  1,304.33
Dec-19 $                               453.44 $                            2,214.57 (*) $  2,668.01
TOTAL $                          12,385.89 $                            3,598.43 $15,984.32
  Today's balance (Dec 31)
AWS Credit $                                      -
Open Source Collective $                            1,291.79

(*) There isn't actually enough balance in the donation account to cover this amount. For now, I will personally cover the difference.

I am reaching out to AWS for another round of donation. Absent additional funding, we will need to take drastic cost-saving measures. Let us find ways to keep the server running. I personally donated USD 3042 to keep the server going for another month, so that we can push out 1.0 release.

Cost by EC2 instance type:

Month P2 G4 C5 Misc Total cost ($)
Mar-19 $651.80 $0.00 $2.71 $55.07 $709.58
Apr-19 $1,133.97 $0.00 $59.19 $76.47 $1,269.63
May-19 $1,159.73 $0.00 $745.00 $80.57 $1,985.30
Jun-19 $942.04 $0.00 $579.10 $87.19 $1,608.34
Jul-19 $863.41 $0.00 $545.44 $113.76 $1,522.61
Aug-19 $934.75 $0.00 $616.85 $122.70 $1,674.30
Sep-19 $535.05 $0.00 $372.18 $130.94 $1,038.16
Oct-19 $821.44 $459.08 $778.56 $144.99 $2,204.07
Nov-19 $321.95 $385.00 $446.01 $151.38 $1,304.34
Dec-19 $553.92 $536.96 $1,369.21 (**) $207.92 $2,668.02
TOTAL $7,918.07 $1,381.04 $5,514.24 $1,170.99 $15,984.34

(**) I accidentally assigned C5.9xlarge type to Windows workers, which blew up the cost by 700 USD. The issue has been fixed, by downgrading them to C5.4xlarge.

Cost by AWS service:

Month EC2 ECR S3 Total
Mar-19 $709.58 $0.00 $0.00 $709.58
Apr-19 $1,265.43 $4.11 $0.09 $1,269.63
May-19 $1,972.74 $11.82 $0.73 $1,985.30
Jun-19 $1,581.12 $23.92 $3.30 $1,608.34
Jul-19 $1,469.42 $38.35 $14.84 $1,522.61
Aug-19 $1,611.53 $54.02 $8.75 $1,674.30
Sep-19 $966.67 $62.80 $8.69 $1,038.16
Oct-19 $2,127.13 $69.87 $7.07 $2,204.07
Nov-19 $1,219.30 $81.50 $3.54 $1,304.34
Dec-19 $2,550.89 $99.94 $17.18 $2,668.02
Total $15,473.82 $446.33 $64.19 $15,984.34
@CodingCat
Copy link
Member

CodingCat commented Jan 1, 2020

there is a jenkins-github plugin with which auto-test can only be triggered when a committer commented with "ok to test", https://wiki.jenkins.io/display/JENKINS/GitHub+pull+request+builder+plugin,

@hcho3
Copy link
Collaborator Author

hcho3 commented Jan 1, 2020

@CodingCat Nice! Thanks for the link. Let me take a look at it. This should let us save some CI runs caused by WIP commits.

@hcho3
Copy link
Collaborator Author

hcho3 commented Jan 1, 2020

One idea is to add a script to throttle provision of EC2 workers to meet the monthly budget limit. If the limit is breached, no new EC2 worker would be launched.

@CodingCat
Copy link
Member

One idea is to add a script to throttle provision of EC2 workers to meet the monthly budget limit. If the limit is breached, no new EC2 worker would be launched.

I agree, even for us internally, we are not provisioning workers for CI in an unlimited manner, instead we tends to queue up the test (to prevent those pre-mature commits taking too many resources)

@hcho3
Copy link
Collaborator Author

hcho3 commented Jan 1, 2020

hcho3/xgboost-devops#2

@hcho3
Copy link
Collaborator Author

hcho3 commented Jan 6, 2020

@CodingCat I installed the GitHub Pull Request Builder plugin. I will watch the repository closely in the next few days and see if the plugin is working. EDIT. This seems to clash with the GitHub Branch Source plugin. Will investigate.

@thvasilo
Copy link
Contributor

thvasilo commented Jan 9, 2020

For personal projects I've also used the [AWS budgets] (https://aws.amazon.com/aws-cost-management/aws-budgets/) feature to at least send email warnings and I'm now looking into autoshutdown of instances as well. I think the budget warnings should be implemented (and it's easy) at least, so you can manually intervene in case something happens, like the wrong instance type eating up all the budget.

@tqchen
Copy link
Member

tqchen commented Jan 9, 2020

I would also recommend us looking at alternatives. For example, github action and azure pipelines are both good options to get CPU CI(dask, spark, windows) and we can only use jenkins for GPU CIs that can be triggered optionally

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants