internal: repos to create a toolchain from a locally installed Python #2000

rickeylev · 2024-06-21T06:05:11Z

This adds the primitives for defining a toolchain based on a locally installed Python.
Doing this consists of two parts:

A repo rule to define a Python runtime pointing to a local Python installation.
A repo rule to define toolchains for those runtimes.

The runtime repos create platform runtimes, i.e, it sets py_runtime.interpreter_path.
This means the runtime isn't included in the runfiles.

Note that these repo rules are largely implementation details, and are definitely not
stable API-wise. Creating public APIs to use them through WORKSPACE or bzlmod will
be done in a separate change (there's a few design and behavior questions to discuss).

This is definitely experimental quality. In particular, the code that tries
to figure out the C headers/libraries is very finicky. I couldn't find solid docs about
how to do this, and there's a lot of undocumented settings, so what's there is what
I was able to piece together from my laptop's behavior.

Misc other changes:

Also fixes a bug if a pyenv-backed interpreter path is used for precompiling:
pyenv uses $0 to determine what to re-exec. The :current_interpreter_executable
target used its own name, which pyenv didn't understand.
The repo logger now also accepts a string. This should help prevent accidentally
passing a string causing an error. It's also just a bit more convenient when
doing development.
Repo loggers will automatically include their rule name and repo name. This
makes following logging output easier.
Makes repo_utils.execute() report progress.
Adds repo_utils.getenv, repo_utils.watch, and repo_utils.watch_tree:
backwards compatibility functions for their rctx equivalents.
Adds repo_utils.which_unchecked: calls which, but allows for failure.
Adds repo_utils.get_platforms_os_name(): Returns the name used in @platforms for
the OS reported by rctx.
Makes several repo util functions call watch() or getenv(), if available. This
makes repository rules better respect environmental changes.
Adds more detail to the definition of an in-build vs platform runtime
Adds a README for the integration tests directory. Setting up and using one is a bit
more involved than other tests, so some docs help.
Allows integration tests to specify bazel versions to use.

python/private/platform_runtime_repo.bzl

rickeylev · 2024-06-23T07:03:40Z

Alright, I mentioned this in chat, but I'll post a more full description of some of the problems a local auto configuring toolchain encounters.

Ideally, we want something like this:

# The user inputs something like this:
local_runtime(
  name = "linux_runtime",
  path = "/usr/bin/python3" # Turns out to be python 3.10
)
local_runtime(
  name = "windows_runtime",
  path = "C:\Program Files\whatever\python3.exe" # turns out to be python 3.11
)

# And it generates something like this:

toolchain(
  name = "linux_toolchain",
  toolchain = "@linux_runtime//:runtime",
  target_compatible_with = [linux],
  target_settings = [is_python_3.10],
)
toolchain(
  name = "windows_toolchain",
  toolchain = "@windows_runtime//:runtime"
  target_compatible_with = [windows],
  target_settings = [is_python_3.11],
)

Unfortunately, I don't see a way to do that without triggers the evaluation of the @linux_runtime and @windows_runtime repos. This is because of the constraint settings:

In order to know the TCW values, we have to...IDK. I think manually associate it with the local_runtime() call? But, that also seems wrong -- /usr/bin/python3 is valid for both linux and mac. So I guess transform rctx.os to a constraint sting (but we want to ignore rctx.os on linux if its a windows path)? Maybe the host constraint is the other option?

For the target settings, we have to run python in order to determine the Python value. But a linux host can't run the windows path and vice versa. So now...idk. Omit target_settings? Consider the whole toolchain incompatible?

But, lets say its the happy path -- we can run the given path, get the version, figure out the os, etc. This is all computed within the runtime repo rule. If a different repo rule generates the toolchain() calls, that info needs to be passed in -- which means in order to access that info, it has to trigger evaluation of the runtime repo. This is true even if we do something like toolchain(target_compatible_with=["@linux_runtime//:os_alias"]) (where os is an alias to the appropriate @platforms target) -- even though its a target reference, toolchain evaluation will have to resolve that target, and thus trigger evaluation of the repo.

The only paths I see to deal with all this are:

Just accept evaluation will occur. It should be relatively cheap? If a runtime can't be setup (e.g. linux trying to run a windows path), silently fail (create like an empty no-op runtime structure or something).
Require (or optionally allow; requiring seems to defeat the point of an auto-configuring thing) passing in platform and Python version info. If we want to get advanced, we can write the platform and version into a separate repo if we know it at repo-evaluation time (this allows resolving the constraints, while avoiding evaluating the runtime repo itself)
Make the local_toolchain(...) api have platform-specific arg stuff. e.g. local_runtime(windows_path="C:\...", linux_path="...", ...); though I'm not sure how to encode python version. It seems like a full and complete set of args would be like a list of (str path, list platform_constraints, list target_settings) tuples or something.

aignas · 2024-06-23T10:07:37Z

I like the approach 1. out of your listed ones. It should be cheap to just have:

def _linux_impl(rctx):
    if not _is_linux(rctx):
        rctx.file("BUILD.bazel", """\
toolchain(
    name = "linux_toolchain",
    toolchain = "@linux_runtime//:runtime",
    target_compatible_with = ["@platforms//:incompatible"],
    target_settings = [],  # maybe we have to add an incompatible target setting here.
)"""
        return

    ...

Because the repository rule is not doing any network IO and the failure is
really fast, then I think we should be good here.

Talking about python versions, do you want this toolchain to be used when
is_python_3.x is true? What about the default case when the python_version
is unset? If we have hermetic toolchains also in the mix, which one is
preferred?

I think that if we could affect the default value of
//python/config_settings:python_version based on the result of this linux
toolchain or the python extension evaluation, it would be super nice, because
that could simplify a lot of code that we have that handles the case where the
python_version string flag is unset, but we have to default to something.

rickeylev · 2024-06-23T18:08:10Z

do you want this toolchain to be used when is_python_3.x is true?

Yes. We end up having to run python no matter what, so we have the version. Integrating it with the version-aware parts just requires adding the target_settings value. So it should be easy.

If we have hermetic toolchains also in the mix, which one is preferred?

By default, the local ones, because they're opt in. It's technically up to the user, though, since they ultimately can futz with the toolchain registration order.

* make logging accept string * add which_unchecked * add text_util.str make watch/watch_tree calls options for earlier bazel support only basic bzlmod support is implemented, and 6.4 lacks some necessary features to test it

rickeylev · 2024-07-13T01:59:07Z

OK, ready for review.

I've been sitting on this awhile, so I pulled back the scope to just the repository rules that can set up the runtime and toolchains. It works, but you can see that there's a decent amount of configuration needed in MODULE.bazel. Integrating it into the python bzlmod extension can clean that up, but it got surprisingly messy. I'll post in chat about some of the design/behavior things we need to decide on. Using it in WORKSPACE is probably similar, but I haven't tried it with workspace builds yet.

aignas

I really like how clean your repository_rule code is. Really nice structure and I like how self-contained the code is. Thank you!

python/private/get_local_runtime_info.py

aignas · 2024-07-16T05:40:29Z

python/private/local_runtime_repo_setup.bzl

+    native.config_setting(
+        name = "_is_major_minor",
+        flag_values = {
+            _PYTHON_VERSION_FLAG: major_minor,
+        },
+    )
+    native.config_setting(
+        name = "_is_major_minor_micro",
+        flag_values = {
+            _PYTHON_VERSION_FLAG: major_minor_micro,
+        },
+    )


As discussed over a call, it could be nice to add another flag to allow the user to disable the toolchain with a line in the .bazelrc:

build --@rules_python//python/config_settings:local_toolchain=disable

This could be done in a followup PR to keep this more scoped.

Yeah, this is an interesting idea. I think I like it? Agree out of scope for this PR.

Hm. Thinking out loud...

It would make for an easy way to use local instead of the hermetic runtimes. This sounds especially appealing for early adoption and testing -- users don't have to put a bunch of experimental config API in their MODULE files.

The basic thing we could do is:

Have bzlmod generate two repos per version (one local, one hermetic)

Use select() on the toolchain attribute, switching based on the flag.

And I think this would work for version-aware toolchains, too. The bzlmod config is given the version so no need to load the local_runtime repo to determine it.

And then we can punt on the problems of mix-and-matching local and hermetic runtimes for now.

python/private/local_runtime_repo_setup.bzl

python/private/repo_utils.bzl

aignas · 2024-07-16T05:47:35Z

python/private/repo_utils.bzl

 repo_utils = struct(
+    # keep sorted
+    debug_print = _debug_print,
    execute_checked = _execute_checked,
-    execute_unchecked = _execute_unchecked,
    execute_checked_stdout = _execute_checked_stdout,
+    execute_unchecked = _execute_unchecked,
+    get_platforms_os_name = _get_platforms_os_name,
+    getenv = _getenv,
    is_repo_debug_enabled = _is_repo_debug_enabled,
-    debug_print = _debug_print,
-    which_checked = _which_checked,
    logger = _logger,
+    watch = _watch,
+    watch_tree = _watch_tree,
+    which_checked = _which_checked,
+    which_unchecked = _which_unchecked,
 )


It feels like this is getting big, but it's good to leave this as is for now. All of the things are low level repository_ctx related things, so it's fine.

Yeah, I've grown to really like our little repo_utils library here. The logger, error reporting, compat functions -- it's actually pretty nice. Part of me wants to rewrite it in an object-oriented fashion so we can do e.g. repo = repo_utils.create(rctx); repo.which_checked(...); repo.log(...); repo.watch(...); etc just to have a cleaner API.

I like how go has the ctx and passes it around everywhere. It is kind of a sign that IO is being done here or there. That makes pure starlark functions really obvious.

So I like the current design a lot. :)

aignas · 2024-07-16T05:49:07Z

tests/integration/local_toolchains/.bazelrc

@@ -0,0 +1,5 @@
+common --action_env=RULES_PYTHON_BZLMOD_DEBUG=1


Oh, using this in tests is really useful. Should we have this in CI for the examples jobs as well?

If it helps debug, sure. It can be a bit spammy, though. We should document it somewhere, though I'm not sure where. It'd be nice if we had a place we can put our debugging tips we've all picked up along the way.

Hm, maybe we should create repo_utils.fail(), and we can have it always include a message like "set --action_env=blabla for debugging information"

I think having a Troubleshooting section might be good. We could point to Troubleshooting sub-sections within Toolchains, PyPI integration, etc.

…o local.py.toolchains

…t used

…les_python into local.py.toolchains

aignas · 2024-07-17T00:06:18Z

BTW, I realized that the current hermetic toolchain understands if it is Linux GLIBC. Is it possible to detect if the interpreter is built for MUSL vs GLIBC?

rickeylev · 2024-07-18T15:57:31Z

After a bit of searching, yes, i think so. The info looks to be in the platforms module: https://stackoverflow.com/questions/72272168/how-to-determine-which-libc-implementation-the-host-system-uses

aignas · 2024-07-19T01:02:03Z

Regarding glibc, if you look at the hermetic toolchain, we are specifying the flag value to be set to glibc (I haven't got to adding musl toolchains yet), but what it essentially is doing is including the right interpreter when building docker images, etc.

I know that this is meant for local toolchains, so is low priority IMHO, but it seems that having the flag_values in the target_settings include the glibc might mean that we can correctly handle precedence of toolchains if we are targeting the musl libc.

rickeylev · 2024-07-19T23:54:59Z

Ok, I was a bit wrong about how easy detecting musl is.

platform has a libc_ver() function, which can return the libc implementation, but it doesn't know how to indicate something is musl. It'll just return empty string. Which maybe that'll have to suffice.

The other option is to use packaging, but that isn't part of the stdlib, and the way it works is a decent amount of code. A small ELF format parser to find some path to something, then runs it to get some output. There's a PR for python itself, python/cpython#103784, however, it seems to have stalled out.

Since it seems like the answer is to somehow parse something out of the executable's ELF information, I tried some various system utils to do the same, but couldn't figure out how.

Looking through sysconfig, there do appear to be several mentions of musl in some variables, e.g. as part of command line argument. So that could be a weak signal.

aignas · 2024-07-20T11:29:06Z

Thanks for digging. Regarding use of packaging, it is used in all of the whl metadata parsing (and env marker eval), so using it should be OK. We could vendor it if needed, just like what pip is doing with it's dependencies, or download it like we do for whl_library. I would probably be +1 on vendoring it since it is such a critical piece of code in the bootstrap.

…

On 20 July 2024 08:55:21 GMT+09:00, Richard Levasseur ***@***.***> wrote: Ok, I was a bit wrong about how easy detecting musl is. `platform` has a `libc_ver()` function, which can return the libc implementation, _but_ it doesn't know how to indicate something is musl. It'll just return empty string. Which maybe that'll have to suffice. The other option is to use `packaging`, but that isn't part of the stdlib, and the way it works is a decent amount of code. A small ELF format parser to find some path to something, then runs it to get some output. There's a PR for python itself, python/cpython#103784, however, it seems to have stalled out. Since it seems like the answer is to somehow parse something out of the executable's ELF information, I tried some various system utils to do the same, but couldn't figure out how. Looking through sysconfig, there do appear to be several mentions of musl in some variables, e.g. as part of command line argument. So that could be a weak signal. -- Reply to this email directly or view it on GitHub: #2000 (comment) You are receiving this because you commented. Message ID: ***@***.***>

aignas reviewed Jun 21, 2024

View reviewed changes

python/private/platform_runtime_repo.bzl Outdated Show resolved Hide resolved

python/private/platform_runtime_repo.bzl Outdated Show resolved Hide resolved

rickeylev force-pushed the local.py.toolchains branch 3 times, most recently from b42834c to 8b115f8 Compare July 12, 2024 22:29

wip: basic local autodetecting local toolchain

c46f3aa

* make logging accept string * add which_unchecked * add text_util.str make watch/watch_tree calls options for earlier bazel support only basic bzlmod support is implemented, and 6.4 lacks some necessary features to test it

rickeylev force-pushed the local.py.toolchains branch from 8b115f8 to c46f3aa Compare July 12, 2024 23:40

basic local autodetecting local toolchain

040f255

rickeylev marked this pull request as ready for review July 13, 2024 01:50

rickeylev changed the title ~~feat(toolchains): create a toolchain from a locally installed Python~~ internal: repos to create a toolchain from a locally installed Python Jul 13, 2024

aignas approved these changes Jul 16, 2024

View reviewed changes

aignas mentioned this pull request Jul 16, 2024

feat(pypi): support env markers in requirements files #2059

Merged

rickeylev and others added 6 commits July 16, 2024 15:26

Merge branch 'main' of https://github.com/bazelbuild/rules_python int…

4a23d71

…o local.py.toolchains

fixup: move comments about config vars to py code where they are firs…

bd180dc

…t used

Merge branch 'bazelbuild:main' into local.py.toolchains

44733e7

Merge branch 'local.py.toolchains' of https://github.com/rickeylev/ru…

3eb8f1c

…les_python into local.py.toolchains

fix comment typo

ffce7b4

re-add shlib-suffix key, oops

0138d1e

rickeylev enabled auto-merge July 16, 2024 23:04

rickeylev mentioned this pull request Jul 16, 2024

Hermetic toolchains incur a significant overhead to running py_binary and py_test #1653

Open

rickeylev added this pull request to the merge queue Jul 16, 2024

rickeylev mentioned this pull request Jul 16, 2024

Allow using a locally installed Python runtime as a full toolchain #2070

Open

Merged via the queue into bazelbuild:main with commit 68c3048 Jul 16, 2024
4 checks passed

rickeylev deleted the local.py.toolchains branch July 17, 2024 06:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

internal: repos to create a toolchain from a locally installed Python #2000

internal: repos to create a toolchain from a locally installed Python #2000

rickeylev commented Jun 21, 2024 •

edited

Loading

rickeylev commented Jun 23, 2024

aignas commented Jun 23, 2024

rickeylev commented Jun 23, 2024

rickeylev commented Jul 13, 2024

aignas left a comment

aignas Jul 16, 2024

rickeylev Jul 16, 2024

aignas Jul 16, 2024

rickeylev Jul 16, 2024

aignas Jul 17, 2024

aignas Jul 16, 2024

rickeylev Jul 16, 2024

aignas Jul 17, 2024

aignas commented Jul 17, 2024

rickeylev commented Jul 18, 2024

aignas commented Jul 19, 2024

rickeylev commented Jul 19, 2024

aignas commented Jul 20, 2024 via email

		@@ -0,0 +1,5 @@
		common --action_env=RULES_PYTHON_BZLMOD_DEBUG=1

internal: repos to create a toolchain from a locally installed Python #2000

internal: repos to create a toolchain from a locally installed Python #2000

Conversation

rickeylev commented Jun 21, 2024 • edited Loading

rickeylev commented Jun 23, 2024

aignas commented Jun 23, 2024

rickeylev commented Jun 23, 2024

rickeylev commented Jul 13, 2024

aignas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aignas commented Jul 17, 2024

rickeylev commented Jul 18, 2024

aignas commented Jul 19, 2024

rickeylev commented Jul 19, 2024

aignas commented Jul 20, 2024 via email

rickeylev commented Jun 21, 2024 •

edited

Loading