Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add support for envsubst in extra_pip_args #1673

Merged
merged 10 commits into from
Jan 31, 2024

Conversation

vonschultz
Copy link
Contributor

Since MODULE.bazel files don't support load(), it is more difficult to grab the value of an environment variable such as PIP_INDEX_URL or PIP_RETRIES and feed it to pip.parse(). Some of these PIP_* variables are useful in extra_pip_args. To enable this use case, also for the bzlmod scenario, add the envsubst attribute, which is a list of environment variables to substitute in extra_pip_args. If $VARNAME, ${VARNAME} or ${VARNAME:-default} is found in extra_pip_args, substitute the value of the variable.

Since MODULE.bazel files don't support load(), it is more difficult to
grab the value of an environment variable such as PIP_INDEX_URL or
PIP_RETRIES and feed it to pip.parse(). Some of these PIP_* variables
are useful in extra_pip_args. To enable this use case, also for the
bzlmod scenario, add the "envsubst" attribute, which is a list of
environment variables to substitute in extra_pip_args. If $VARNAME,
${VARNAME} or ${VARNAME:-default} is found in extra_pip_args,
substitute the value of the variable.
Comment on lines 368 to 369
A list of environment variables to substitute (e.g. `["PIP_INDEX_URL",
"PIP_RETRIES"]`). The corresponding variables are expanded in `extra_pip_args`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vonschultz, could you elaborate a little on what is the usecase here? Is it only for the requirements files that your project owns? What should we do for the PIP_INDEX_URL in a non-root module? On one hand I think that the root module should be able to override the index url + extra index urls for everyone, but if that is what we want to achieve with this PR, then maybe having support for this in:

pip.override(
    index_url = "foo",
    extra_index_urls = ["bar", "baz"],
)

might be maybe cleaner?

We could append the index_url and prepend the extra_index_urls to the extra pip args and those would take precedence (I think).

Are there other env var names that you want to expand here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The task I need it for is converting a project from WORKSPACE.bazel to MODULE.bazel in a case where the WORKSPACE.bazel uses environment variables (by writing them to a file using a repo rule and loading that file). Since loading isn't supported in MODULE.bazel I need the support for environment variables here instead. I have only been thinking in terms of the root module, so I'm not entirely sure what the desired behavior would be in a non-root module.

In principle, there are as many potential use cases as there are flags to pip, but to focus on the ones explicitly mentioned in the patch:

  • PIP_RETRIES, and also PIP_TIMEOUT for that matter, are useful when the user has an unstable Internet connection. Rather like Bazel's own --http_timeout_scaling, which a user might put in their ~/.bazelrc, it is a property of the environment we run in, rather than the repo itself.
  • PIP_INDEX_URL is given to us by our CI system, and is pointing to an AWS CodeArtifact URL. The PIP_INDEX_URL contains a short-lived token for authentication, so we need to read it from the environment rather than hardcoding it in the MODULE.bazel file.

I haven't been thinking of pip.override. Isn't that specific for a given Python package (distribution)?
https://rules-python.readthedocs.io/en/latest/api/extensions/pip.html#pip-override
It doesn't sound like it's something you'd use to override the index url.

I'm not convinced overriding the index url for everyone (including non-root modules) is the desired behavior. If someone uses the default https://pypi.org/simple certainly, we'd like to be able to redirect that, but if a non-root module very explicitly and purposefully uses a different index url containing a small set of packages not found on the public PyPI, then it's not a given that we should override that.

(Incidentally, I think extra_index_urls is a huge security issue, and should never be used. That's how we got hit by https://pytorch.org/blog/compromised-nightly-dependency/, and I'm sure there are many other cases like it.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the environment variables I mentioned, I can also imagine PIP_PROXY being useful.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for elaborating on the extra index url usage.

I am wondering if the right thing here would be to have an attr for each thing pip supports - the downside would be that we would need to have stuff in sync with the pip version we support/use, but the upside would be that it would be more structured and we could more easily ensure that sensitive info does not leak into the lock file.

What do you think? How many extra attributes would you need? @rickeylev, do you have any thoughts on this?

python/private/bzlmod/pip.bzl Outdated Show resolved Hide resolved
Comment on lines 368 to 369
A list of environment variables to substitute (e.g. `["PIP_INDEX_URL",
"PIP_RETRIES"]`). The corresponding variables are expanded in `extra_pip_args`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for elaborating on the extra index url usage.

I am wondering if the right thing here would be to have an attr for each thing pip supports - the downside would be that we would need to have stuff in sync with the pip version we support/use, but the upside would be that it would be more structured and we could more easily ensure that sensitive info does not leak into the lock file.

What do you think? How many extra attributes would you need? @rickeylev, do you have any thoughts on this?

Instead of implementing envsubst support directly in the pip bzlmod
extension, move it into the affected rules (whl_library and
pip_repository). This way, the functionality will also benefit people
who don't use bzlmod yet, and it may reduce the risk of leaking
sensitive info into the MODULE.bazel.lock file.
@aignas
Copy link
Collaborator

aignas commented Jan 18, 2024

Retried the build and it was fine. There are feature requests to use the bazel downloader to download wheels/sdists from PyPI instead of using PyPI and I think envsubst would be required to be used separately from the whl_library as well. Would you be willing to create a separate file (maybe in python/private/envsubst.bzl) and add a few unit tests in //tests/private/envsubst/<mychosenname>? This and a CHANGELOG entry and this PR should be ready to be merged IMHO.

Move envsubst to its own file to facilitate testing, and add some unit
tests in tests/private/envsubst. Since tests revealed that nesting
silently produces unexpected results, add a check to limit nesting
of environment variables in the default of a different environment
variable expansion.
We've added envsubst to pip_parse and pip_repository.
In //python/pip_install:pip_repository_bzl, add a dependency to a new
//python/private:envsubst_bzl library, which has the envsubst.bzl
file.
@vonschultz
Copy link
Contributor Author

I got a new error on Windows now, which at a glance seems unrelated to my patches:

The process cannot access the file because it is being used by another process

Looks like I don't have the required access level to retry the job, however, so I'm afraid I'll have to ask you to take a look at it, if you don't mind: https://buildkite.com/bazel/rules-python-python/builds/6937#018d1cf7-6942-4148-af4c-2747940a0be6

@rickeylev
Copy link
Contributor

rickeylev commented Jan 18, 2024

Yeah, that's a flake. I pressed retry.

Few thoughts.

re: extra index url being a security issue: I don't think it's any worse than a custom repository rule? An arbitrary module might implement a custom repo rule to download arbitrary things or execute arbitrary things and include it into a program that gets run at build time. extra index url isn't any worse. It's still scoped to their pip.parse/targets.

re: environment variables: Should we add some PIP_* names to the repo rule's/module extensions environment variable settings? Or perhaps pip.parse(consider_env=True) flag, and under the hood it calls a variation of the pip repo rule that has repository_rule.environ set to the various pip environment variables.

Right now, any changes to the environment won't be reflected in subsequent runs -- you'll have to manually sync/clean (and thus invalidate everything) to pick up any envvar changes.

it is more difficult to grab the value of an environment variable such as PIP_INDEX_URL or PIP_RETRIES and feed it to pip.parse()

I think the thing you have to do is create a repo rule to read the env var and pass it as a file into pip.parse. Something like:

# MODULE.bazel

use_extension("//:pipenviron.bzl", "pipenviron")
pip.parse(environ_file = "@pipenviron//:environ.json")

# pipenviron.bzl

def module_impl(mctx):
  pip_env(name="pipenviron")

pip_env = repository_rule(..., environ=["PIP_RETRIES", ..])

def pip_env_impl(rctx):
  rctx.file("environ.json", json.encode({"PIP_RETRIES": rctx.os.environ(...)}))

Unfortunately, a repository rule can't have a dynamic set of environment variables. We could put a "includes all the env var" rule in rules_python, but if you just want a subset, then you have to define the repository rule yourself (which rules_python could provide helpers to do to make simpler).

edit: use_repo_rule() can simply the above; no need to create the extension

@vonschultz
Copy link
Contributor Author

Right now, any changes to the environment won't be reflected in subsequent runs -- you'll have to manually sync/clean (and thus invalidate everything) to pick up any envvar changes.

Ah, yes, you're right. But the environ attribute of repository_rule is deprecated:

Deprecated. This parameter has been deprecated. Migrate to repository_ctx.getenv instead.
Provides a list of environment variable that this repository rule depends on. If an environment variable in that list change, the repository will be refetched.

https://bazel.build/rules/lib/globals/bzl#repository_rule

I think switching from rctx.os.environ to rctx.getenv should be sufficient to make sure changes to the environment will be picked up.

When building incrementally, any change to the value of the variable named by name will cause this repository to be re-fetched.

https://bazel.build/rules/lib/builtins/repository_ctx.html#getenv

(Tough for things like PIP_TIMEOUT, do we really want a change of the variable to cause already-fetched things to be re-fetched? As long as it takes effect for things we haven't already fetched, that's sufficient for that use case.)

@vonschultz
Copy link
Contributor Author

Maybe I spoke too soon. The repository_ctx.getenv() function was introduced just 16 hours ago, and is not yet in any official Bazel release. I could write code that tries getenv if available and falls back to rctx.os.environ if it's not there, but then we'd not have environment changes reflected on Bazel releases older than today. Or we could go with an environment file, and that would work the same way with all Bazel versions, even though we could have done this in a nicer way on newer Bazel versions.

@rickeylev
Copy link
Contributor

Tough for things like PIP_TIMEOUT, do we really want a change of the variable to cause already-fetched things to be re-fetched? As long as it takes effect for things we haven't already fetched, that's sufficient for that use case.

Yeah, I agree. We can skirt the "correctness" for something like PIP_TIMEOUT until getenv is more widely available. PIP_RETRIES falls into this category too, right?

getenv sounds much nicer, but yeah, we'll need to support Bazel 6 until Dec 2025 and Bazel until Dec 2026. So that's a ways off. If you're using the rolling bazel releases, then I'd be fine with conditional usage of getenv.

@aignas WDYT: punt on this envvar-change-detection issue for now? We can always come back and set the environ list or add an environ_file arg.

Copy link
Collaborator

@aignas aignas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general though I agree with the discussion above about using the available APIs for simplicity and maintainability.

CHANGELOG.md Outdated Show resolved Hide resolved
python/pip_install/pip_repository.bzl Outdated Show resolved Hide resolved
In the case where you vendor the requirements.bzl file, the
environment variables at the time the requirements.bzl file was
generated were being used. It is much more likely that the desired
behavior is to use the environment variables at the time the
dependencies are fetched, apart from the fact that some environment
variables may contain sensitive info.

There's already an envsubst in _whl_library_impl, so we just need to
propagate the envsubst list in the _config we write to the vendored
requirements.bzl file.
Move the envsubst parts of the CHANGELOG.md to the unreleased section.
At the time of writing, the very latest Bazel, as in
`USE_BAZEL_VERSION=last_green bazelisk`, supports `rctx.getenv(name,
default)`: When building incrementally, any change to the value of the
variable named by name will cause this repository to be re-fetched.
That hasn't yet made its way into the official releases, though.
@aignas aignas added this pull request to the merge queue Jan 31, 2024
Merged via the queue into bazelbuild:main with commit 8fa6212 Jan 31, 2024
4 checks passed
@vonschultz vonschultz deleted the envsubst branch January 31, 2024 14:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants