-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request]: Make Python container venv location configurable #29663
Comments
Would setting given that venv in semipersist dir doesn't work for dataflow, we could detect if dataflow is used as a special case and if so, set RUN_PYTHON_SDK_IN_DEFAULT_ENVIRONMENT. The extra venv wrap doesn't make much sense for dataflow. |
Spark local dir is pretty much your writable tmp scratch space. That's where all the temporary output files and shuffle spills land. If you run Spark in a container, it's usually a semi-persistent storage mount, such as a Kubernetes emptyDir. I think Flink has something similar to this. |
It sounds reasonable to use
Note that the rootcause of Dataflow's issue was that semi-persistent dir was backed by /var directory, which is often configured as Per https://stackoverflow.com/questions/46525081/does-kubernetes-mount-an-emtpydir-volume-on-the-host , emptyDir also might live in /var. I wonder if you might encounter same frictions. |
Inside the container, it's usually You cannot set mount flags for emptyDirs, so according to the docs, you cannot set |
That's what I am worried about. It might depend on host OS. Dataflow uses CointainerOS which had this issue back when I looked at it last. |
There is probably no solution that works everywhere. You also cannot expect /opt/apache to be writable on anything but the standard SDK container. |
This is not a dataflow concern, I am just wondering if you are going to hit the same problem on your spark cluster as we had with dataflow when we created |
I suppose you could check whether this a noexec partition in your deployment. |
I know for sure it doesn't. The question is whether this is the default for most other deployments out there. |
Is it possible to check whether folder has executable permission inside docker container, or this can only be checked on host vm? |
What would you like to happen?
In #16658, I added the feature that the Python SDK harness
boot
binary installs all application dependencies into a temporary venv to make containers reusable.At the moment, this is hard-coded to /opt/apache/beam-venv and falls back to the default env if that fails or if
RUN_PYTHON_SDK_IN_DEFAULT_ENVIRONMENT
is set. Now my own requirements have shifted and I would like to make this path configurable.In a first iteration of my PR back then, I used the value of the
-semiPersistDir
flag, which defaults to/tmp
. Unfortunately, this caused some Dataflow tests to fail (iirc), so we decided to use /opt/apache/beam-venv, which is writable by the beam user in the upstream Python SDK container, but may not be so in other environments.I think, using /tmp would make sense, but making it configurable would be better. Even better would be if the runner implementation could set this automatically. For instance, the Spark runner would set this to
spark.local.dir
.I could submit a PR to make this configurable with a flag (either -semiPersistDir or an additional flag if this still doesn't work). I would not know, however, how to get this value automatically from the job runner. Perhaps someone else has an idea?
cc @tvalentyn
Issue Priority
Priority: 2 (default / most feature requests should be filed as P2)
Issue Components
The text was updated successfully, but these errors were encountered: