Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-4032]Download whl for a known target platform on the container image else download sources #16448

Closed
wants to merge 6 commits into from

Conversation

AnandInguva
Copy link
Contributor

If a user uses default container image, the target platform is debian. Here we can download whls instead of sources which would decrease startup time of the workers.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

ValidatesRunner compliance status (on master branch)

Lang ULR Dataflow Flink Samza Spark Twister2
Go --- Build Status Build Status Build Status Build Status ---
Java Build Status Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Python --- Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status ---
XLang Build Status Build Status Build Status Build Status Build Status ---

Examples testing status on various runners

Lang ULR Dataflow Flink Samza Spark Twister2
Go --- --- --- --- --- --- ---
Java --- Build Status
Build Status
Build Status
--- --- --- --- ---
Python --- --- --- --- --- --- ---
XLang --- --- --- --- --- --- ---

Post-Commit SDK/Transform Integration Tests Status (on master branch)

Go Java Python
Build Status Build Status Build Status
Build Status
Build Status

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website Whitespace Typescript
Non-portable Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status Build Status Build Status
Portable --- Build Status Build Status --- --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests

See CI.md for more information about GitHub Actions CI.

@codecov
Copy link

codecov bot commented Jan 6, 2022

Codecov Report

Merging #16448 (468b1a4) into master (928da54) will increase coverage by 8.99%.
The diff coverage is 59.45%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16448      +/-   ##
==========================================
+ Coverage   74.60%   83.59%   +8.99%     
==========================================
  Files         653      452     -201     
  Lines       81691    62093   -19598     
==========================================
- Hits        60945    51907    -9038     
+ Misses      19767    10186    -9581     
+ Partials      979        0     -979     
Impacted Files Coverage Δ
...s/python/apache_beam/runners/portability/stager.py 85.17% <59.45%> (-3.12%) ⬇️
...pache_beam/runners/portability/portable_metrics.py 89.65% <0.00%> (-10.35%) ⬇️
...ks/python/apache_beam/runners/worker/data_plane.py 87.50% <0.00%> (-1.71%) ⬇️
sdks/python/apache_beam/dataframe/partitionings.py 94.53% <0.00%> (-1.51%) ⬇️
sdks/python/apache_beam/internal/metrics/metric.py 90.00% <0.00%> (-1.00%) ⬇️
...pache_beam/runners/interactive/interactive_beam.py 75.60% <0.00%> (-0.98%) ⬇️
...runners/interactive/display/pcoll_visualization.py 85.85% <0.00%> (-0.51%) ⬇️
...s/python/apache_beam/io/gcp/bigquery_file_loads.py 87.70% <0.00%> (-0.03%) ⬇️
sdks/python/apache_beam/metrics/metric.py 95.38% <0.00%> (ø)
sdks/python/apache_beam/runners/common.py 89.98% <0.00%> (ø)
... and 217 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 928da54...468b1a4. Read the comment docs.

@AnandInguva
Copy link
Contributor Author

cc: @tvalentyn.

Still verifying if these files match the files that the worker looks for in the requirements cache

@AnandInguva AnandInguva marked this pull request as ready for review January 10, 2022 19:37
@AnandInguva AnandInguva changed the title [BEAM-4032]Download whl if the container image is the default container image. [BEAM-4032]Download whl for a known target platform on the container image else download sources Jan 10, 2022
@AnandInguva
Copy link
Contributor Author

Run Python 3.8 Postcommit

@AnandInguva
Copy link
Contributor Author

AnandInguva commented Jan 10, 2022

R: @tvalentyn
cc: @pabloem

sdks/python/apache_beam/runners/portability/stager.py Outdated Show resolved Hide resolved
sdks/python/apache_beam/runners/portability/stager.py Outdated Show resolved Hide resolved
"""Downloads SDK package from PyPI and returns path to local path."""
package_name = Stager.get_sdk_package_name()
package_name = package_name or Stager.get_sdk_package_name()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover?

Or did you mean to combine _download_pypi_package and _download_pypi_sdk_package into the same method (as in, you would pass package_name = Stager.get_sdk_package_name() into _download_pypi_package to download the SDK)? That would make sense, as it's mostly the same code. We can remove the warning specific to the apache beam SDK package.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I wanted to combine them as they work in similar way. I made a small change to this. This function would fetch apache beam sdk (source or whl) by passing arguments to the _download_pypi_package. I didn't combine them for now but will make a note of it

]

if fetch_binary:
cmd_args.pop() # remove the no binary flag
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make it part of the condition:

if fetch_binary:
  # add binary flags:
else:
  # add no-binary flags
# run the command.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

# downloads the binary distributions
(
populate_requirements_cache if populate_requirements_cache else
Stager._populate_requirements_cache_with_whl)(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep one method: _populate_requirements_cache, and add add a flag such as prefer_bdists=True/False (or fetch_binary to match other flags) to avoid code duplication

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Stager._populate_requirements_cache)(
setup_options.requirements_file, requirements_cache_path)
if setup_options.view_as(WorkerOptions).sdk_container_image is None:
# downloads the binary distributions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If user passes a custom image we should advise against passing --requirements_file in a warning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a warning. I will add the link to the website on how to use custom containers with requirements once that website page is ready.

@AnandInguva
Copy link
Contributor Author

Run Python 3.8 Postcommit

Copy link
Contributor

@tvalentyn tvalentyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, let's share a one-pager on dev@ with summary of the problem, possible solutions and trade-offs calling out the behavior change for users, indicate which solution you propose but also leave room for feedback & further input.

@@ -197,6 +198,9 @@ def create_job_resources(options, # type: PipelineOptions
resources = [] # type: List[beam_runner_api_pb2.ArtifactInformation]

setup_options = options.view_as(SetupOptions)
# True when sdk_container_image is apache beam image
fetch_binary = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code might be a little more self-explanatory with:

Suggested change
fetch_binary = (
has_default_container_image = (

then, we can use below: fetch_binary=has_default_container_image when passing into a function.

if not fetch_binary: # display warning.
_LOGGER.warning(
'Avoid using requirements.txt when using a '
'custom container image.')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wording suggestion for the message:

When using a custom container image, prefer installing additional PyPI dependencies directly into the image, instead of specifying them via runtime options, such as --requirements_file.

@@ -221,10 +225,16 @@ def create_job_resources(options, # type: PipelineOptions
setup_options.requirements_file, REQUIREMENTS_FILE))
# Populate cache with packages from the requirement file option and
# stage the files in the cache.
if not fetch_binary: # display warning.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the inline comment is not necessary, the comment is self-explanatory in this case. Btw, https://stackoverflow.blog/2021/12/23/best-practices-for-writing-code-comments/ is a good read. Rule 1 applies here.

@@ -162,7 +163,7 @@ def create_job_resources(options, # type: PipelineOptions
temp_dir, # type: str
build_setup_args=None, # type: Optional[List[str]]
pypi_requirements=None, # type: Optional[List[str]]
populate_requirements_cache=None, # type: Optional[str]
populate_requirements_cache=None, # type: Optional[Callable]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Let's include argument types. I think it's:

Suggested change
populate_requirements_cache=None, # type: Optional[Callable]
populate_requirements_cache=None, # type: Optional[Callable[[str, str], None]]

abi_tag = 'cp%d%d%s' % (
sys.version_info[0], sys.version_info[1], abi_suffix
) # ABI tag to use
# install each package individually
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add explanation why installing each package individually is necessary in this case.

@@ -824,32 +890,17 @@ def _download_pypi_sdk_package(
raise RuntimeError(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove the arguments that can be inferred (language_version, impl tag, abi tag, platform) from the function signature

language_version_tag='27',
language_implementation_tag='cp',
abi_tag='cp27mu',
platform_tag='manylinux2014_x86_64',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's create a helper function that returns target platform, and check pip version in that helper. For pip >= 19.3 let's use manylinux2014_x86_64 wheel, otherwise use manylinux2010_x86_64 wheel and add a TODO: when pypa/pip#10760 is addressed, download the wheel based on glibc version in Beam's Python SDK base image.

language_implementation_tag='cp',
abi_tag='cp27mu',
platform_tag='manylinux2014_x86_64',
package_name=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(optional) It may make sense to move this param earlier in the argument list

# because we stage the SDK separately.
def _populate_requirements_cache(requirements_file, # type: str
cache_dir, # type: str
fetch_binary=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To reduce code complexity, I think we can omit fetch_binary param from this flag. Currently, the only case when fetch_binary=False branch is used is custom container users who pass the requirements file. I think it's ok to change the behavior for these users as well, so that we download sources one-by-one from the requirements.txt file in this case, as opposed to download sources from the entire requirements file

for requirement in requirements_to_install:
try:
# download the bdist compatible with the debian platform
# TODO(anandinguva): the platform tag will get updated. Check PEP 600
Copy link
Contributor

@tvalentyn tvalentyn Jan 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From offline discussion:

  • We can rewrite TODO to make it more actionable (do X why Y is fixed)
  • _download_pypi_package currently does not do the download, but builds the command line
  • We can likely use the same method for _download_pypi_package and _download_pypi_sdk_package if the download params are the same.

@aaltay
Copy link
Member

aaltay commented Jan 27, 2022

@AnandInguva - what is the next step on this PR?

@AnandInguva
Copy link
Contributor Author

AnandInguva commented Jan 27, 2022

@aaltay

@AnandInguva - what is the next step on this PR?

We had few offline discussions on this approach and encountered few blockers from PyPI to implement efficient functionality. All the discussion can be found here. A different solution for this would be implemented this week.

I will change this to draft PR to preserve the comments that can be used in another PR with discussed solution.

@AnandInguva AnandInguva marked this pull request as draft January 27, 2022 05:16
@AnandInguva
Copy link
Contributor Author

Closing this wrt #16633

@AnandInguva AnandInguva closed this Feb 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants