-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Python SDK version 2.51 - extra_package option in PipelineOptions not propagating to workers in #29037
Comments
Thanks for reporting the issue. I am taking a look now. |
Are you using a zip or tar ball for the extra_package? I see you have passed a string, which should raise a RuntimeError. |
One difference between 2.50.0 and 2.51.0 is that for 2.51.0, tensorflow has been removed from the Apache Beam docker containers at #28424. I have created a simple
Can you share the script you are using to create the tarball/zip for the |
Hi, Thanks for getting back quickly. I might have tagged the issue incorrectly, so sorry about that 😬. To address your question on the tarball/zip for the However, what's really puzzling is that using the string directly in --extra_package arg didn't give any runtime errors in either versions |
Okay. I have tested your code. When you pass Your code worked on 2.50.0 because even though you were passing a module name(tensorflow) to Passing a string(for ex, module name) to To solve your concern, you can use |
Thanks for explanation and testing my code. After revisiting the documentation, I understand now that utilizing the requirements_file is indeed the more appropriate method. Ultimately tho I'm considering a custom image with my Beam pipeline to ensure all dependencies are streamlined. And great that we identified a bug through this. Feel free to manage this issue how you see fit, whether keeping it open or referencing it elsewhere. Appreciate your assistance! |
One more thing before closing - the way you are passing pipeline options, the values for those options are lazily updated. When the parser parses the command line args, it maps
|
What happened?
SDK Version with issue: 2.51
SDK Version without issue: 2.50
I've encountered an issue with the extra_package option in PipelineOptions when submitting a job using the DataflowRunner. The dependencies specified in the extra_package option are not being installed on the Dataflow workers. This behavior is specific to SDK version 2.51. The same code works as expected when using SDK version 2.50.
Details
Example code:
Observed Behavior
When running with SDK version 2.51, the Dataflow workers fail with a ModuleNotFoundError for tensorflow, suggesting that the tensorflow==2.13 package was not installed on the workers. No such error is observed with SDK version 2.50.
Expected Behavior
The tensorflow==2.13 package should be installed on the Dataflow workers, and the job should proceed without errors related to missing modules.
Would appreciate any assistance or workaround for this issue. Thank you!
Issue Priority
Priority: 1 (data loss / total loss of function)
Issue Components
The text was updated successfully, but these errors were encountered: