-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-10184] Build python wheels on GitHub Actions for Linux/MacOS #11877
[BEAM-10184] Build python wheels on GitHub Actions for Linux/MacOS #11877
Conversation
github-actions are currently running on PR in fork: TobKed#3 |
391627b
to
557209a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can I preview the action on the fork: TobKed#3 ?
- name: Upload compressed sources | ||
uses: actions/upload-artifact@v2 | ||
with: | ||
name: source_gztar_zip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this do? Is it re-compressing the source folder. I wonder if we can use the sdist output as it?
(Ideally the resulting GCS output look something close enoguh to a release output, e.g. https://dist.apache.org/repos/dist/release/beam/2.21.0/python/ )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is uploading sdist output (zip
,tag.gz
) files as artifacts where they are picked up by upload_source_to_gcs
jobs. An additional advantage of the artifacts is fact that they can be downloaded from GitHub Actions workflow web view as well.
Example of GCS output for build_source -> upload_source_to_gcs
jobs:
gs://**/apache-beam-2.23.0.dev0.tar.gz
gs://**/apache-beam-2.23.0.dev0.zip
here you can check GitHub Action which ran on my fork.
Wheel files (*.whl
) are processed accordingly by build_wheels -> upload_wheels_to_gcs
jobs.
Example of GCS output at the end of the workflow:
gs://**/apache-beam-2.23.0.dev0.tar.gz
gs://**/apache-beam-2.23.0.dev0.zip
gs://**/apache_beam-2.23.0.dev0-cp27-cp27m-macosx_10_9_x86_64.whl
gs://**/apache_beam-2.23.0.dev0-cp27-cp27m-manylinux1_i686.whl
gs://**/apache_beam-2.23.0.dev0-cp27-cp27m-manylinux1_x86_64.whl
gs://**/apache_beam-2.23.0.dev0-cp27-cp27m-manylinux2010_i686.whl
gs://**/apache_beam-2.23.0.dev0-cp27-cp27m-manylinux2010_x86_64.whl
gs://**/apache_beam-2.23.0.dev0-cp27-cp27mu-manylinux1_i686.whl
gs://**/apache_beam-2.23.0.dev0-cp27-cp27mu-manylinux1_x86_64.whl
gs://**/apache_beam-2.23.0.dev0-cp27-cp27mu-manylinux2010_i686.whl
gs://**/apache_beam-2.23.0.dev0-cp27-cp27mu-manylinux2010_x86_64.whl
gs://**/apache_beam-2.23.0.dev0-cp35-cp35m-macosx_10_9_x86_64.whl
gs://**/apache_beam-2.23.0.dev0-cp35-cp35m-manylinux1_i686.whl
gs://**/apache_beam-2.23.0.dev0-cp35-cp35m-manylinux1_x86_64.whl
gs://**/apache_beam-2.23.0.dev0-cp35-cp35m-manylinux2010_i686.whl
gs://**/apache_beam-2.23.0.dev0-cp35-cp35m-manylinux2010_x86_64.whl
gs://**/apache_beam-2.23.0.dev0-cp36-cp36m-macosx_10_9_x86_64.whl
gs://**/apache_beam-2.23.0.dev0-cp36-cp36m-manylinux1_i686.whl
gs://**/apache_beam-2.23.0.dev0-cp36-cp36m-manylinux1_x86_64.whl
gs://**/apache_beam-2.23.0.dev0-cp36-cp36m-manylinux2010_i686.whl
gs://**/apache_beam-2.23.0.dev0-cp36-cp36m-manylinux2010_x86_64.whl
gs://**/apache_beam-2.23.0.dev0-cp37-cp37m-macosx_10_9_x86_64.whl
gs://**/apache_beam-2.23.0.dev0-cp37-cp37m-manylinux1_i686.whl
gs://**/apache_beam-2.23.0.dev0-cp37-cp37m-manylinux1_x86_64.whl
gs://**/apache_beam-2.23.0.dev0-cp37-cp37m-manylinux2010_i686.whl
gs://**/apache_beam-2.23.0.dev0-cp37-cp37m-manylinux2010_x86_64.whl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice.
2 follow up questions:
- Can we add a very last stage that can run: gsutil ls "gs://*/$GITHUB_REF##//" -> so that we can get the whole output of the gcs folder all at once.
(btw, do we have a mechanism to clean up these gcs locations?)
- What is the difference between "Build python wheels / Build wheels on ..." job executing "Upload wheels" step and "Upload wheels to GCS bucket ..." job?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Currently two steps:
List sources on GCS bucket
andCopy wheels to GCS bucket
are listing files of specific types. Instead of this two separate steps I could create job which will list all files in specific gcs folder. I think it would be much cleaner and explicit. Did I understand correctly your idea?
About cleaning up these GCS locations I consider two options:
- setting lifecycle management on the bucket which will delete files older than some arbitrary age, e.g. 365 days. I think advantage of this is that will be maintenance free.
- creating another scheduled workflow on github actions which will delete gcs folders if corresponding branch does not exist anymore. Could be scheduled to run e.g. once pre week.
Which option has more sense for you?
- "Upload" steps perform file upload as artifacts so they could be passed between jobs and being available for download for 90 days (if not deleted earlier). These artifacts are picked up later by "Upload to GCS" jobs. What do you think about renaming these steps e.g.: "Upload wheels" -> "Upload wheels as artifacts" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I missed this earlier.
I think it would be much cleaner and explicit. Did I understand correctly your idea?
Yes. Your understanding is correct. This sounds good to me.
Clean up:
Both options are good. Can we do both?
What do you think about renaming these steps e.g.: "Upload wheels" -> "Upload wheels as artifacts" ?
Sounds good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be much cleaner and explicit. Did I understand correctly your idea?
Yes. Your understanding is correct. This sounds good to me.
I renamed steps. Please take a look is there no mistakes.
Clean up:
Both options are good. Can we do both?
Sure. We can do both.
Related to versioning and lifecycle management: I added information in the PR description.
For periodic cleaning tasks I created separate dependent draft PR: #12049
Could we make it such that:
|
.github/workflows/build_wheels.yml
Outdated
python-version: 3.7 | ||
- name: Get build dependencies | ||
working-directory: ./sdks/python | ||
run: python3 -m pip install cython && python3 -m pip install -r build-requirements.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need cython
at this point, since it's installed and used in build_wheels
job. Most probably the same applies to wheels
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point 👍
I've tested and for build_source
job only python3 -m pip install -r build-requirements.txt
is required, then python -m pip install cython
is executed in build_wheels
.
@aaltay https://github.com/TobKed/beam/actions?query=branch%3Agithub-actions-build-wheels-linux-macos |
Super nice! I have not reviewed the steps, I think PR is better for doing the most initial review. We can do a final quick review of this once we settle on this PR. |
cc @potiuk, @brucearctor |
c7d9630
to
32cdf7a
Compare
After rethinking gh-actions I made some updates which changes CI behavior dependent on the triggering event as presented below:
Additionally I've added @aaltay answering your questions:
WDYT? |
This looks nice. I have a few clarifiying questions: on pull_request: This is good. Would it trigger on every pull request. This may not be needed. I am not sure what GH resource we will have and what kind of queue there will be. We would not want ot add test load to all PRs. Questions:
on push:
on schedule: this looks good. Other questions:
+1.
Yes, I think this works. My question was how could a release manager build wheels. Release manager will push some commits but they will not do this using a PR. Would it still trigger builds? And release manager might want to build release at a specific commit in the release branch could they do this by opening a PR? |
@aaltay thank you for review. Answers for your questions:
I added
GitHub Action artifacts allow to persist data on storage space to share that data with another job.
It is done automatically in
I think I made mistake here. To be precise
Currently source files are build here build_release_candidate.sh#L174 based on RELEASE_BRANCH, the release branch is created few lines above build_release_candidate.sh#L109, this push will trigger build of python sources and wheels. I hope it answers your questions :) |
I think your last reply mostly answer my questions. Follow ups:
Release branch names look like this: release-2.22.0
This action will create the source package and wheels right? So, it would not rely on the sh files to build a tar ball and push it? |
Based on Apache Airflow cancel workflow https://github.com/apache/airflow/blob/master/.github/workflows/cancel.yml
5581b66
to
f675d00
Compare
@aaltay in relation to triggering patters, I think I misunderstood naming conventions. I will make update. Thank you for noticing it.
Yes, it will create both. Whole process of building from scratch can be done in the GItHub Actions workflow. I will create another PR dependent on it, which will use GitHub Actions Artifacts and/or GCS in release sh files. |
Thank you @TobKed. I replied to one of your earlier comments. Besides that I think this looks good. Feel free to ping me if I do not respond to your questions within a few days. |
…rerun Previously every build on given branch would ovewrite files: gs://bucket-name/branch-name/ # first build gs://bucket-name/branch-name/ # rerun worklow gs://bucket-name/branch-name/ # new commit Now they will be differentiated: gs://bucket-name/branch-name/8121a1...-140818789/ # first build gs://bucket-name/branch-name/8121a1...-140818789/ # rerun worklow, new files versions gs://bucket-name/branch-name/2323b0...-140818794/ # new commit Wersions of reruns files could be checked with objects versioning (if enabled on the bucket) It will allow better builds tracking: e.g. night-builds
9559e71
to
6237068
Compare
@TobKed - please ping me if this is ready to be merged. |
@aaltay I think it is ready, however I have some questions before it can me merged:
WDYT? |
I do not have access to the "Settings" tab. Could you work with @tysonjh and infra on this one?
I can give you access to apache-beam-testing project to update things. Or I can do it. Let me know.
Sounds good to me.
@tysonjh could help. |
HI @aaltay and @tysonjh Please take a final look on the code before merging |
🎉 |
Thank you! Could you check that new workflows work as expected? |
Thanks! Sure, I will monitor is everything fine :) |
Looks everything is fine 😀 |
🎉 |
Nice, thank you! 🎉 |
…pache#11877) [BEAM-10184] Build python wheels on GitHub Actions for Linux/MacOS (apache#11877) Individual commit notes: * Add licence header * Set environments labels to the 'latest' * Fix typos * Use default verbosity for cibuildwheel * Add Python 3.8 version for wheels * Simplify steps * Refactor trigger events * List files uploaded to GCS in seperate job * Update step naming * Add nightly master build with automatic tagging * Cancel running builds on second push to PR Based on Apache Airflow cancel workflow https://github.com/apache/airflow/blob/master/.github/workflows/cancel.yml * Upload files to GCS only for pushes to relese and release candidate branches * Add pattern to match release candidate branch * Add paths filter for pull requests * fixup! Add paths filter for pull requests * fixup! Cancel running builds on second push to PR * Fix trigering patterns * Upload additional files with information about buld * Hange GCS path for uploading artifacts to prevent overwriting during rerun Previously every build on given branch would ovewrite files: gs://bucket-name/branch-name/ # first build gs://bucket-name/branch-name/ # rerun worklow gs://bucket-name/branch-name/ # new commit Now they will be differentiated: gs://bucket-name/branch-name/8121a1...-140818789/ # first build gs://bucket-name/branch-name/8121a1...-140818789/ # rerun worklow, new files versions gs://bucket-name/branch-name/2323b0...-140818794/ # new commit Wersions of reruns files could be checked with objects versioning (if enabled on the bucket) It will allow better builds tracking: e.g. night-builds * Put bucket name directly into the workflow
Build python on GitHub actions for easier release process in the future.
GitHub Actions can be previewed on in fork: https://github.com/TobKed/beam/tree/github-actions-build-wheels-linux-macos
Files on GCS Bucket
Pattern
gs://[BUCKET_NAME]/[BRANCH_NAME]/[COMMIT-SHA1]-[WORKFLOW_RUN_ID]/
Files
*.tar.gz
,*.zip
- python source distribution*.whl
- wheelsevent.json
- github action event payload.github_action_info
- additional variables not available inevent.json
Example output on GCS Bucket:
https://github.com/TobKed/beam/runs/793751247?check_suite_focus=true
Before merging
Before merging it is required to setup related secrets:
GCP_PROJECT_ID
- ID of the Google Cloud project e.g:apache-beam-testing
GCP_SA_EMAIL
- Service account email address to use for authentication. Service account requiresStorage Object Admin
role. e.g.: beam-wheels-staging-sa@bapache-beam-testing.iam.gserviceaccount.comGCP_SA_KEY
- The service account key which will be used for authentication. Service account requiresStorage Object Admin
role. This key should be created, encoded as a Base64 string (eg.cat my-key.json | base64
on macOS)GCP_BUCKET
- name of the GCS Bucket e.g:beam-wheels-staging
Staging Bucket settings
For practical reasons staging bucket shall have versioning and lifecycle management.
Object versioning
it will keep previous versions of uploaded files in case of rerunning workflow.
e.g. paths:
gs://bucket-name/branch-name/8121a1...-140818789/
- first buildgs://bucket-name/branch-name/8121a1...-140818789/
- rerun workflow, new files versionsgs://bucket-name/branch-name/2323b0...-140818794/
- new commitObject versioning cen be enabled by command:
https://cloud.google.com/storage/docs/using-object-versioning
Lifecycle management
Lifecycle management is able to delete files which are older than some given age.
It will prevent accumulating old files.
I propose lifecycle settings as follows:
gsutil commnad:
https://cloud.google.com/storage/docs/managing-lifecycles
JIRA
Subtask of BEAM-9388 - Consider using github actions for building python wheels and more (aka. Transition from Travis)
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username
).[BEAM-XXX] Fixes bug in ApproximateQuantiles
, where you replaceBEAM-XXX
with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.