Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Test] Improve Katib CI/CD GitHub Actions #2024

Closed
5 of 7 tasks
andreyvelich opened this issue Nov 18, 2022 · 5 comments
Closed
5 of 7 tasks

[Test] Improve Katib CI/CD GitHub Actions #2024

andreyvelich opened this issue Nov 18, 2022 · 5 comments

Comments

@andreyvelich
Copy link
Member

andreyvelich commented Nov 18, 2022

/kind feature
/area testing

Recently, we switched to GitHub Actions for our CI/CD pipelines, thanks a lot again @tenzen-y for driving this.

Since we have limitations now: 20 concurrent jobs and we haven't set AWS EC2 instances for our workers yet, we need to do some improvements to reduce execution time.

I think, we can try to do the following:

  1. Should we run postgres test only for Random search experiment ? We run 3 Trials for Random experiment, so we can verify that DB works properly.

  2. Can we build only the required suggestions images for each e2e test ? As I can see, build step takes around 15 min which is more than half of e2e.

  3. @tenzen-y Are there any specific requirements why we clean cache for our build image after each e2e run ?

  4. Do we need to build images for linux/amd64 if that is verified as part of e2e ?

  5. In the longterm/separate tracking issue we can also do this:

    • Run only required Experiments when appropriate source code has been changed (what we've done with Katib UI).
    • Run all experiments test in periodic manner, e.g. once a day. For Pull Request test we can use only few e2e experiments.
    • Use Katib SDK instead of this script to run e2e, similar to the Training Operator. So we can verify that our SDK is working.

@kubeflow/wg-training-leads @tenzen-y @anencore94 Are there any other improvements that you have in your mind ?

GitHub Actions improvements checklist

I can identify the following improvements:

  • Run postgres e2e only for random search.
  • Use Katib SDK to create E2E script.
  • Disable workflow when the new commit is published by using cancel-in-progress API.
  • Use Docker cache when building our images.
  • Remove linux/amd64 build from the pre-commit check since we verify this in E2E test.
  • Identify experiments for E2E from the appropriate code changes.
  • Run all E2E test only on the pre-releases or in the periodic manner.

Please let me know if we should add more items @johnugeorge @anencore94 @terrytangyuan @tenzen-y @gaocegege


Love this feature? Give it a 👍 We prioritize the features with the most 👍

@terrytangyuan
Copy link
Member

terrytangyuan commented Nov 18, 2022

One thing that might help is to avoid concurrent builds on the same PR (in case people pushed multiple commits which trigger separate builds). https://github.com/argoproj/argo-workflows/blob/master/.github/workflows/ci-build.yaml#L12-L14

We should utilize the cache on GitHub Actions.

If the image builds are time-consuming, we should consider pre-building the image cache that Docker can use.

@anencore94
Copy link
Member

  1. Should we run postgres test only for Random search experiment ? We run 3 Trials for Random experiment, so we can verify that DB works properly.

I agree with you. Change to test on postgres only for one general experiment makes sense.

  1. In the longterm/separate tracking issue we can also do this:

Also, I think we could seperate the e2e test on two stage, on.pull_request and on pre-release. It is reasonable to narrow e2e test on pull_requests action, However I think it is more safe to test each possible cases at least one time before release. Since some combination cases makes some unexpected results.

@tenzen-y
Copy link
Member

@andreyvelich Thanks for creating this issue.

Can we build only the required suggestions images for each e2e test ? As I can see, build step takes around 15 min which is more than half of e2e.

Makes sense. When I migrated e2e tests to gh-actions, I made all e2e tests build all suggestion images to avoid complicated shell scripts. But as you say, we can avoid the complex scripts to rebuild the e2e test using the katib Python client.

@tenzen-y Are there any specific requirements why we clean cache for our build image after each e2e run ?

I added the step to avoid the error write /var/lib/docker/tmp/GetImageBlob424493410: no space left on device.
But we might be able to remove the step to clean caches if we make all e2e tests build only the required suggestions images for each e2e test as mentioned above.

Do we need to build images for linux/amd64 if that is verified as part of e2e ?

I added the platform linux/amd64 to verify if we can build multi-platform images. In e2e, we only build single-platform images.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@andreyvelich
Copy link
Member Author

Since @tenzen-y made our E2E actions very stable we can close this issue. Thanks again for this effort!
Let's track additional improvements separately .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants