-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-4032]Download whl for a known target platform on the container image else download sources #16448
Conversation
Codecov Report
@@ Coverage Diff @@
## master #16448 +/- ##
==========================================
+ Coverage 74.60% 83.59% +8.99%
==========================================
Files 653 452 -201
Lines 81691 62093 -19598
==========================================
- Hits 60945 51907 -9038
+ Misses 19767 10186 -9581
+ Partials 979 0 -979
Continue to review full report at Codecov.
|
cc: @tvalentyn. Still verifying if these files match the files that the worker looks for in the requirements cache |
8817eb8
to
84a7241
Compare
Run Python 3.8 Postcommit |
R: @tvalentyn |
"""Downloads SDK package from PyPI and returns path to local path.""" | ||
package_name = Stager.get_sdk_package_name() | ||
package_name = package_name or Stager.get_sdk_package_name() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leftover?
Or did you mean to combine _download_pypi_package
and _download_pypi_sdk_package
into the same method (as in, you would pass package_name = Stager.get_sdk_package_name()
into _download_pypi_package
to download the SDK)? That would make sense, as it's mostly the same code. We can remove the warning specific to the apache beam SDK package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I wanted to combine them as they work in similar way. I made a small change to this. This function would fetch apache beam sdk (source or whl) by passing arguments to the _download_pypi_package
. I didn't combine them for now but will make a note of it
] | ||
|
||
if fetch_binary: | ||
cmd_args.pop() # remove the no binary flag |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would make it part of the condition:
if fetch_binary:
# add binary flags:
else:
# add no-binary flags
# run the command.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
# downloads the binary distributions | ||
( | ||
populate_requirements_cache if populate_requirements_cache else | ||
Stager._populate_requirements_cache_with_whl)( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would keep one method: _populate_requirements_cache, and add add a flag such as prefer_bdists=True/False
(or fetch_binary
to match other flags) to avoid code duplication
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
Stager._populate_requirements_cache)( | ||
setup_options.requirements_file, requirements_cache_path) | ||
if setup_options.view_as(WorkerOptions).sdk_container_image is None: | ||
# downloads the binary distributions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If user passes a custom image we should advise against passing --requirements_file
in a warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a warning. I will add the link to the website on how to use custom containers with requirements once that website page is ready.
Run Python 3.8 Postcommit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, let's share a one-pager on dev@ with summary of the problem, possible solutions and trade-offs calling out the behavior change for users, indicate which solution you propose but also leave room for feedback & further input.
@@ -197,6 +198,9 @@ def create_job_resources(options, # type: PipelineOptions | |||
resources = [] # type: List[beam_runner_api_pb2.ArtifactInformation] | |||
|
|||
setup_options = options.view_as(SetupOptions) | |||
# True when sdk_container_image is apache beam image | |||
fetch_binary = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code might be a little more self-explanatory with:
fetch_binary = ( | |
has_default_container_image = ( |
then, we can use below: fetch_binary=has_default_container_image
when passing into a function.
if not fetch_binary: # display warning. | ||
_LOGGER.warning( | ||
'Avoid using requirements.txt when using a ' | ||
'custom container image.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wording suggestion for the message:
When using a custom container image, prefer installing additional PyPI dependencies directly into the image, instead of specifying them via runtime options, such as --requirements_file.
@@ -221,10 +225,16 @@ def create_job_resources(options, # type: PipelineOptions | |||
setup_options.requirements_file, REQUIREMENTS_FILE)) | |||
# Populate cache with packages from the requirement file option and | |||
# stage the files in the cache. | |||
if not fetch_binary: # display warning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the inline comment is not necessary, the comment is self-explanatory in this case. Btw, https://stackoverflow.blog/2021/12/23/best-practices-for-writing-code-comments/ is a good read. Rule 1 applies here.
@@ -162,7 +163,7 @@ def create_job_resources(options, # type: PipelineOptions | |||
temp_dir, # type: str | |||
build_setup_args=None, # type: Optional[List[str]] | |||
pypi_requirements=None, # type: Optional[List[str]] | |||
populate_requirements_cache=None, # type: Optional[str] | |||
populate_requirements_cache=None, # type: Optional[Callable] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Let's include argument types. I think it's:
populate_requirements_cache=None, # type: Optional[Callable] | |
populate_requirements_cache=None, # type: Optional[Callable[[str, str], None]] |
abi_tag = 'cp%d%d%s' % ( | ||
sys.version_info[0], sys.version_info[1], abi_suffix | ||
) # ABI tag to use | ||
# install each package individually |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add explanation why installing each package individually is necessary in this case.
@@ -824,32 +890,17 @@ def _download_pypi_sdk_package( | |||
raise RuntimeError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove the arguments that can be inferred (language_version, impl tag, abi tag, platform) from the function signature
language_version_tag='27', | ||
language_implementation_tag='cp', | ||
abi_tag='cp27mu', | ||
platform_tag='manylinux2014_x86_64', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's create a helper function that returns target platform, and check pip version in that helper. For pip >= 19.3 let's use manylinux2014_x86_64 wheel, otherwise use manylinux2010_x86_64 wheel and add a TODO: when pypa/pip#10760 is addressed, download the wheel based on glibc version in Beam's Python SDK base image.
language_implementation_tag='cp', | ||
abi_tag='cp27mu', | ||
platform_tag='manylinux2014_x86_64', | ||
package_name=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(optional) It may make sense to move this param earlier in the argument list
# because we stage the SDK separately. | ||
def _populate_requirements_cache(requirements_file, # type: str | ||
cache_dir, # type: str | ||
fetch_binary=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To reduce code complexity, I think we can omit fetch_binary
param from this flag. Currently, the only case when fetch_binary=False
branch is used is custom container users who pass the requirements file. I think it's ok to change the behavior for these users as well, so that we download sources one-by-one from the requirements.txt file in this case, as opposed to download sources from the entire requirements file
for requirement in requirements_to_install: | ||
try: | ||
# download the bdist compatible with the debian platform | ||
# TODO(anandinguva): the platform tag will get updated. Check PEP 600 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From offline discussion:
- We can rewrite TODO to make it more actionable (do X why Y is fixed)
- _download_pypi_package currently does not do the download, but builds the command line
- We can likely use the same method for _download_pypi_package and _download_pypi_sdk_package if the download params are the same.
@AnandInguva - what is the next step on this PR? |
We had few offline discussions on this approach and encountered few blockers from PyPI to implement efficient functionality. All the discussion can be found here. A different solution for this would be implemented this week. I will change this to draft PR to preserve the comments that can be used in another PR with discussed solution. |
Closing this wrt #16633 |
If a user uses default container image, the target platform is debian. Here we can download whls instead of sources which would decrease startup time of the workers.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username
).[BEAM-XXX] Fixes bug in ApproximateQuantiles
, where you replaceBEAM-XXX
with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
ValidatesRunner
compliance status (on master branch)Examples testing status on various runners
Post-Commit SDK/Transform Integration Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.