Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Python SDK version 2.51 - extra_package option in PipelineOptions not propagating to workers in #29037

Closed
2 of 16 tasks
kamalmemon opened this issue Oct 17, 2023 · 7 comments
Assignees
Labels
dataflow done & done Issue has been reviewed after it was closed for verification, followups, etc. P2 python

Comments

@kamalmemon
Copy link

What happened?

SDK Version with issue: 2.51
SDK Version without issue: 2.50

I've encountered an issue with the extra_package option in PipelineOptions when submitting a job using the DataflowRunner. The dependencies specified in the extra_package option are not being installed on the Dataflow workers. This behavior is specific to SDK version 2.51. The same code works as expected when using SDK version 2.50.

Details

Example code:

job_options = {
    'job_name': "<REDACTED>",
    'project': "<REDACTED>",
    'runner': "DataflowRunner",
    'temp_location': "<REDACTED>",
    'region': "us-central1",
    'subnetwork': "<REDACTED>",
    'extra_package': 'tensorflow==2.13',
}

def run(output_path, query, project, feature_dtype_map, options):

    opts = PipelineOptions(**options)
    google_cloud_options = opts.view_as(GoogleCloudOptions)
    google_cloud_options.labels = ['<REDACTED>']

    with beam.Pipeline(options=opts) as p:
        (
            p | 'ReadFromBigQuery' >> beam.io.Read(ReadFromBigQuery(query=query, use_standard_sql=True))
            | 'ConvertToTFExample' >> beam.Map(<REDACTED>)
            | 'WriteToTFRecord' >> <REDACTED>
        )

Observed Behavior
When running with SDK version 2.51, the Dataflow workers fail with a ModuleNotFoundError for tensorflow, suggesting that the tensorflow==2.13 package was not installed on the workers. No such error is observed with SDK version 2.50.

Expected Behavior
The tensorflow==2.13 package should be installed on the Dataflow workers, and the job should proceed without errors related to missing modules.

Would appreciate any assistance or workaround for this issue. Thank you!

Issue Priority

Priority: 1 (data loss / total loss of function)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
@AnandInguva
Copy link
Contributor

Thanks for reporting the issue. I am taking a look now.

@AnandInguva
Copy link
Contributor

job_options = {
    'job_name': "<REDACTED>",
    'project': "<REDACTED>",
    'runner': "DataflowRunner",
    'temp_location': "<REDACTED>",
    'region': "us-central1",
    'subnetwork': "<REDACTED>",
    'extra_package': 'tensorflow==2.13',
}

Are you using a zip or tar ball for the extra_package? I see you have passed a string, which should raise a RuntimeError.

@AnandInguva
Copy link
Contributor

AnandInguva commented Oct 17, 2023

One difference between 2.50.0 and 2.51.0 is that for 2.51.0, tensorflow has been removed from the Apache Beam docker containers at #28424.

I have created a simple setup.py and ran the commands python setup.py sdist or pip install build && python -m build from the directory where setup.py is located. It creates a tar ball and I pass --extra_package=tarball to a simple beam pipeline(which uses tensorflow). The pipelines are working as expected for both 2.50.0 and 2.51.0

from setuptools import setup

setup(
    name="custom-setup",
    version="0.1",
    packages=[],  # specify your packages here if you have any
    install_requires=[
        'tensorflow'
    ]
)

Can you share the script you are using to create the tarball/zip for the --extra_package pipeline option?

@AnandInguva AnandInguva added P2 and removed P1 bug labels Oct 17, 2023
@kamalmemon
Copy link
Author

kamalmemon commented Oct 17, 2023

Hi,

Thanks for getting back quickly. I might have tagged the issue incorrectly, so sorry about that 😬.

To address your question on the tarball/zip for the --extra_package option: I've been simply passing the string 'tensorflow==2.13', not a package path or tarball. Surprisingly, this worked in SDK 2.50.0, but with 2.51.0, I'm running into the described issue.But as you mentioned, the removal of TensorFlow from the Apache Beam docker containers in the new version explains the cause.

However, what's really puzzling is that using the string directly in --extra_package arg didn't give any runtime errors in either versions 2.50.0 or 2.51.0. This seems a bit off, especially considering the documentation and your feedback.

@AnandInguva
Copy link
Contributor

AnandInguva commented Oct 17, 2023

Okay. I have tested your code. When you pass extra_package in a dict like you did, the dataflow runner launches the job with extra_package=tensorflow but this gets ignored on the dataflow worker since it expects a tar ball.

Your code worked on 2.50.0 because even though you were passing a module name(tensorflow) to extra_package, tensorflow is already installed in the Apache Beam containers. For 2.51.0, tensorflow is not installed, hence you get ImportError.

Passing a string(for ex, module name) to --extra_package is a bug and we can solve this in 2.52.0. If you pass the extra_package as options.view_as(SetupOptions).extra_packages = tensorflow, it will throw RuntimeError but the pattern you mentioned above doesn't.

To solve your concern, you can use requirements_file pipeline option and pass a requirements.txt with tensorflow in it.

@kamalmemon
Copy link
Author

Thanks for explanation and testing my code. After revisiting the documentation, I understand now that utilizing the requirements_file is indeed the more appropriate method. Ultimately tho I'm considering a custom image with my Beam pipeline to ensure all dependencies are streamlined.

And great that we identified a bug through this. Feel free to manage this issue how you see fit, whether keeping it open or referencing it elsewhere.

Appreciate your assistance!

@AnandInguva
Copy link
Contributor

One more thing before closing - the way you are passing pipeline options, the values for those options are lazily updated. When the parser parses the command line args, it maps --extra_package to --extra_packages. So when you want to lazily provide value for pipeline options, you should pass extra_packages instead of extra_package since the dest is extra_packages at

@github-actions github-actions bot added this to the 2.52.0 Release milestone Oct 17, 2023
@AnandInguva AnandInguva self-assigned this Oct 18, 2023
@damccorm damccorm added the done & done Issue has been reviewed after it was closed for verification, followups, etc. label Nov 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataflow done & done Issue has been reviewed after it was closed for verification, followups, etc. P2 python
Projects
None yet
Development

No branches or pull requests

4 participants