-
Notifications
You must be signed in to change notification settings - Fork 541
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add support for envsubst in extra_pip_args #1673
Conversation
Since MODULE.bazel files don't support load(), it is more difficult to grab the value of an environment variable such as PIP_INDEX_URL or PIP_RETRIES and feed it to pip.parse(). Some of these PIP_* variables are useful in extra_pip_args. To enable this use case, also for the bzlmod scenario, add the "envsubst" attribute, which is a list of environment variables to substitute in extra_pip_args. If $VARNAME, ${VARNAME} or ${VARNAME:-default} is found in extra_pip_args, substitute the value of the variable.
python/private/bzlmod/pip.bzl
Outdated
A list of environment variables to substitute (e.g. `["PIP_INDEX_URL", | ||
"PIP_RETRIES"]`). The corresponding variables are expanded in `extra_pip_args` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vonschultz, could you elaborate a little on what is the usecase here? Is it only for the requirements files that your project owns? What should we do for the PIP_INDEX_URL
in a non-root module? On one hand I think that the root module should be able to override the index url + extra index urls for everyone, but if that is what we want to achieve with this PR, then maybe having support for this in:
pip.override(
index_url = "foo",
extra_index_urls = ["bar", "baz"],
)
might be maybe cleaner?
We could append the index_url
and prepend the extra_index_urls
to the extra pip args and those would take precedence (I think).
Are there other env var names that you want to expand here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The task I need it for is converting a project from WORKSPACE.bazel
to MODULE.bazel
in a case where the WORKSPACE.bazel
uses environment variables (by writing them to a file using a repo rule and loading that file). Since loading isn't supported in MODULE.bazel
I need the support for environment variables here instead. I have only been thinking in terms of the root module, so I'm not entirely sure what the desired behavior would be in a non-root module.
In principle, there are as many potential use cases as there are flags to pip
, but to focus on the ones explicitly mentioned in the patch:
PIP_RETRIES
, and alsoPIP_TIMEOUT
for that matter, are useful when the user has an unstable Internet connection. Rather like Bazel's own --http_timeout_scaling, which a user might put in their~/.bazelrc
, it is a property of the environment we run in, rather than the repo itself.PIP_INDEX_URL
is given to us by our CI system, and is pointing to an AWS CodeArtifact URL. ThePIP_INDEX_URL
contains a short-lived token for authentication, so we need to read it from the environment rather than hardcoding it in theMODULE.bazel
file.
I haven't been thinking of pip.override
. Isn't that specific for a given Python package (distribution)?
https://rules-python.readthedocs.io/en/latest/api/extensions/pip.html#pip-override
It doesn't sound like it's something you'd use to override the index url.
I'm not convinced overriding the index url for everyone (including non-root modules) is the desired behavior. If someone uses the default https://pypi.org/simple certainly, we'd like to be able to redirect that, but if a non-root module very explicitly and purposefully uses a different index url containing a small set of packages not found on the public PyPI, then it's not a given that we should override that.
(Incidentally, I think extra_index_urls
is a huge security issue, and should never be used. That's how we got hit by https://pytorch.org/blog/compromised-nightly-dependency/, and I'm sure there are many other cases like it.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from the environment variables I mentioned, I can also imagine PIP_PROXY
being useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for elaborating on the extra index url usage.
I am wondering if the right thing here would be to have an attr
for each thing pip
supports - the downside would be that we would need to have stuff in sync with the pip
version we support/use, but the upside would be that it would be more structured and we could more easily ensure that sensitive info does not leak into the lock file.
What do you think? How many extra attributes would you need? @rickeylev, do you have any thoughts on this?
python/private/bzlmod/pip.bzl
Outdated
A list of environment variables to substitute (e.g. `["PIP_INDEX_URL", | ||
"PIP_RETRIES"]`). The corresponding variables are expanded in `extra_pip_args` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for elaborating on the extra index url usage.
I am wondering if the right thing here would be to have an attr
for each thing pip
supports - the downside would be that we would need to have stuff in sync with the pip
version we support/use, but the upside would be that it would be more structured and we could more easily ensure that sensitive info does not leak into the lock file.
What do you think? How many extra attributes would you need? @rickeylev, do you have any thoughts on this?
Instead of implementing envsubst support directly in the pip bzlmod extension, move it into the affected rules (whl_library and pip_repository). This way, the functionality will also benefit people who don't use bzlmod yet, and it may reduce the risk of leaking sensitive info into the MODULE.bazel.lock file.
Retried the build and it was fine. There are feature requests to use the bazel downloader to download wheels/sdists from PyPI instead of using PyPI and I think |
Move envsubst to its own file to facilitate testing, and add some unit tests in tests/private/envsubst. Since tests revealed that nesting silently produces unexpected results, add a check to limit nesting of environment variables in the default of a different environment variable expansion.
We've added envsubst to pip_parse and pip_repository.
In //python/pip_install:pip_repository_bzl, add a dependency to a new //python/private:envsubst_bzl library, which has the envsubst.bzl file.
I got a new error on Windows now, which at a glance seems unrelated to my patches:
Looks like I don't have the required access level to retry the job, however, so I'm afraid I'll have to ask you to take a look at it, if you don't mind: https://buildkite.com/bazel/rules-python-python/builds/6937#018d1cf7-6942-4148-af4c-2747940a0be6 |
Yeah, that's a flake. I pressed retry. Few thoughts. re: extra index url being a security issue: I don't think it's any worse than a custom repository rule? An arbitrary module might implement a custom repo rule to download arbitrary things or execute arbitrary things and include it into a program that gets run at build time. extra index url isn't any worse. It's still scoped to their pip.parse/targets. re: environment variables: Should we add some PIP_* names to the repo rule's/module extensions environment variable settings? Or perhaps Right now, any changes to the environment won't be reflected in subsequent runs -- you'll have to manually sync/clean (and thus invalidate everything) to pick up any envvar changes.
I think the thing you have to do is create a repo rule to read the env var and pass it as a file into pip.parse. Something like:
Unfortunately, a repository rule can't have a dynamic set of environment variables. We could put a "includes all the env var" rule in rules_python, but if you just want a subset, then you have to define the repository rule yourself (which rules_python could provide helpers to do to make simpler). edit: use_repo_rule() can simply the above; no need to create the extension |
Ah, yes, you're right. But the
https://bazel.build/rules/lib/globals/bzl#repository_rule I think switching from
https://bazel.build/rules/lib/builtins/repository_ctx.html#getenv (Tough for things like |
Maybe I spoke too soon. The |
Yeah, I agree. We can skirt the "correctness" for something like PIP_TIMEOUT until getenv is more widely available. PIP_RETRIES falls into this category too, right? getenv sounds much nicer, but yeah, we'll need to support Bazel 6 until Dec 2025 and Bazel until Dec 2026. So that's a ways off. If you're using the rolling bazel releases, then I'd be fine with conditional usage of getenv. @aignas WDYT: punt on this envvar-change-detection issue for now? We can always come back and set the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general though I agree with the discussion above about using the available APIs for simplicity and maintainability.
In the case where you vendor the requirements.bzl file, the environment variables at the time the requirements.bzl file was generated were being used. It is much more likely that the desired behavior is to use the environment variables at the time the dependencies are fetched, apart from the fact that some environment variables may contain sensitive info. There's already an envsubst in _whl_library_impl, so we just need to propagate the envsubst list in the _config we write to the vendored requirements.bzl file.
Move the envsubst parts of the CHANGELOG.md to the unreleased section.
At the time of writing, the very latest Bazel, as in `USE_BAZEL_VERSION=last_green bazelisk`, supports `rctx.getenv(name, default)`: When building incrementally, any change to the value of the variable named by name will cause this repository to be re-fetched. That hasn't yet made its way into the official releases, though.
Since
MODULE.bazel
files don't supportload()
, it is more difficult to grab the value of an environment variable such asPIP_INDEX_URL
orPIP_RETRIES
and feed it topip.parse()
. Some of thesePIP_*
variables are useful inextra_pip_args
. To enable this use case, also for the bzlmod scenario, add theenvsubst
attribute, which is a list of environment variables to substitute inextra_pip_args
. If$VARNAME
,${VARNAME}
or${VARNAME:-default}
is found inextra_pip_args
, substitute the value of the variable.