Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for building C extensions #824

Open
2 of 5 tasks
blais opened this issue Sep 6, 2022 · 35 comments
Open
2 of 5 tasks

Add support for building C extensions #824

blais opened this issue Sep 6, 2022 · 35 comments

Comments

@blais
Copy link

blais commented Sep 6, 2022

🚀 feature request

  • Add support for getting Python headers (Python.h) -- use @rules_python//python/cc:current_py_cc_headers
  • Provide shared libraries (libpython.so et al) -- use @rules_python//python/cc:current_py_cc_libs
  • Provide static libraries (libpython.a)
  • Add a rule/rules to build C extensions
  • Add PyInfo.cc_info field to allow C/C++ info (CcInfo) to propagate (directly
    propagating CcInfo it isn't done to prevent a Python target from
    being accidentally accepted by a plain CC target)

Describe alternatives you've considered

I've cobbled support for these things by hand in the past, was hoping building extension modules would be officially supported at some point.

Some projects, such as tensorflow and pybind11_bazel, have a python_configure.bzl file and/or py_extension rule that attempts to implement this.

(Note: this issue was originally about updating the readme to address the lack of this feature; it's been edited to be a feature request for this feature directly - rickeylev)

@rickeylev
Copy link
Contributor

Can you clarify in the README how this module relates to building extension modules?

Which readme are you referring to, this repo's readme? This repo, pybind11, and tensorflow have different owners. There's some owner-adjacency with this repo and pybind11 (Me and Ralf are on the same team within Google, but don't overlap in work much really), but they're different projects. I think grpc also has a custom solution that mostly-works-except-when-it-doesn't.

Pybind11 is a higher-level way to write C extensions.

Tesorflow's purpose is an ML library -- if it provides a py_extension, then it's almost certainly just incidental. I've talked to various TF folk and they, too, want an official py_extension rule they can use.

I'm pretty certain all of these solutions are incomplete and/or unsound in some way -- I don't think the toolchain provides enough information to implement them entirely correctly.

hoping building extension modules would be officially supported at some point.

Yes, that's the plan. There aren't really any details beyond that, though. While there is a py_extension rule within Google, it's not compatible with Bazel as-is and has some historical baggage that make it unsavory to release as-is.

I'm happy to discuss design and implementation ideas about it, though.

@blais
Copy link
Author

blais commented Sep 6, 2022

Can you clarify in the README how this module relates to building extension modules?

Which readme are you referring to, this repo's readme? This repo, pybind11, and tensorflow have different owners. There's some owner-adjacency with this repo and pybind11 (Me and Ralf are on the same team within Google, but don't overlap in work much really), but they're different projects. [...]

Pybind11 is a higher-level way to write C extensions.

Tesorflow's purpose is an ML library -- if it provides a py_extension, then it's almost certainly just incidental. I've talked to various TF folk and they, too, want an official py_extension rule they can use.

Hi Richard,
I mean this particular repo. From an external user's perspective, the support for building Python extension module isn't clear. This is the repo for Python rules for Bazel. Isn't this the most appropriate location to tell Bazel users about and to locate support for building extension modules for Python? I read rules_python ~= "Bazel + Python".

Tensorflow is only involved because they had to build those rules for themselves before others. As such, their files are used as examples people copy rules from. Pybind11 is indeed another project but closely related ("Bazel + Python + C++"). It probably deserves to be coordinated with this one? I spoke with Ralf a while ago (years ago) and my understanding is that the priority was elsewhere.

I think grpc also has a custom solution that mostly-works-except-when-it-doesn't.

Exactly. (Do you believe this is a great state of affairs?)
It seems to me this is something to gather in one place and fix once and for all for all these core projects.
In an ideal world, there would be a single copy of both these files, at least in the Google-owned projects.

Right now I don't even know which one to model mine from.

I'm pretty certain all of these solutions are incomplete and/or unsound in some way -- I don't think the toolchain provides enough information to implement them entirely correctly.

Here's what's happening now: AFAICT looking at OSS projects, people - including Googlers for Google open source projects - are copying two types of artifacts:

  • configuration to build against a local runtime, usually called python_configure.bzl
  • rules for building extension modules, usually called py_extension.bzl

Mutating and evolving copies of these two files are proliferating. At the minimum, it would be wise to place a generic version of each in here that serves as either a directly usable version or as an authoritative source for these definitions - one that perhaps, if it is not adequately parameterizable, still requires to be copied and modified by some projects. At least having a single, latest, best, central version of these two files would allow one to track them and merge their changes into their local copies. Not ideal but that would already be an improvement over the current situation.

hoping building extension modules would be officially supported at some point.

Yes, that's the plan. There aren't really any details beyond that, though. While there is a py_extension rule within Google, it's not compatible with Bazel as-is and has some historical baggage that make it unsavory to release as-is.

I'm happy to discuss design and implementation ideas about it, though.

I'm outside now, but I don't remember what the particular baggage involved was. It would be useful to know what is needed to make these rules usable outside of google3, and Bazel folks are thinking and what the current status for supporting those is. For example, is the reason they're not supported out of the box due to a particular technical issue? If so, can it be listed / documented somewhere? Or is it that their definition always involve a particular custom specification?

(This problem is emblematic of the issue with Google open source. Why isn't the company making an effort to make most of its open source projects easily importable together using Bazel? IMHO a worthy goal would be for the core OSS projects to be made usable with a one-liner in a WORKSPACE file, without having to clone files and spend hours figuring out incompatibilities and patches. That's an decent project for promo IMO, requiring coordination across teams all over the org and a deep understanding of how Bazel works and some understanding of all the core projects. Right now it's a bit of a mess, even Google's own OSS projects duplicate these files and it makes it difficult to cobble a set of them together easily.)

@rickeylev
Copy link
Contributor

rickeylev commented Sep 24, 2022

To build a DSO extension, we basically need 2 + 1 pieces.

First, we need a toolchain to give us the Python headers (really, the CcInfo for them). We have that in rules_python, sort of. The hermetic toolchain has a cc_library with the headers, but I don't think you can actually get to it. I don't think a system toolchain has anything here. A catch here is I think this has to be a separate toolchain (or maybe just toolchain type) for bootstrapping reasons. (see below).

Second, we need APIs to invoke the cc tools. We have that, sort of. Well, maybe fully now. The key piece is being able to tell the cc apis that the output library name needs to be e.g. "foo.so" instead of "libfoo.so"[1]. I think this can be worked around by creating a symlink with the desired name (there's a java-only cc api for this; not sure if its available to starlark yet). But, also, there's the issue of what goes into the DSO. The full transitive dependency? Just the srcs? Some subset of the transitive closure? This is where cc_shared_library comes in; it basically lets you control all that. I'm not sure if it's considered generally available yet; it was only mentioned to me a couple months ago internally (I think they've been working closely with Tensorflow).

The +1 part is py_binary needs some way to consume the DSO. Just sticking it data of a py_library to get it into the runfiles mostly works; this is basically what people do today and works OK. The short coming I see is that downstream targets, like binaries, don't know the DSO is there. It's just another data file in the runfiles. As a concrete example of what this metadata is good for, internally, we use that metadata to help inform how to turn the py_binary into a C executable that embeds Python.

The bootstrapping issue I'm not 100% sure about. The particular example I have in mind is coverage support. This was recently added to the Python toolchain; so to use it your Python toolchain has to depend on it. Coverage has its own C extension (ctrace, iirc). If you want to build coverage from source, then how do you get the Python headers? You have to ask the toolchain. But you're in the middle of constructing that toolchain, so how can you ask that toolchain for information? (Again, not 100% sure here, but this is how I understand things work). I realize most of you are fans of e.g. pre-built wheels (they def have their advantages), but supporting building from source is also major win.

Internally, the baggage our py_extension rule carries is it's very tightly coupled with cc_binary; as in, at the Java level, it's a subclass of it and calls various internal APIs. It has some other quirks, but that's the main thing. This is a case where starting fresh with the modern CC Starlark APIs is probably going to work out better. A lot more thought has gone into the CC rules and how to do this since when our rule came about. I think this might be entirely doable in Starlark today.

Hope this answers part of your question!

re: DSO infix, pybind11, andwheel building from #1113: If the extension-building rule also carried information about its ABI and python-tag needs, then this can help inform what extension a DSO can or should use.

[1] It should have the CPython infixes, too, but you get the gist.

@blais
Copy link
Author

blais commented Sep 24, 2022

Thanks Richard, this is a great insight into the depth of the issues involved in making this work.
I need to chase down definitions for some of the bits you allude to, probably will come back with questions.
Thank you,

@rickeylev
Copy link
Contributor

@rwgk @jblespiau @michaelreneer @laramiel
(various people at Google I know who are interested in this problem as well)

@laramiel
Copy link

We also have rules similar to the above in tensorstore:

https://github.com/google/tensorstore/blob/master/third_party/python/python_configure.bzl

@rickeylev
Copy link
Contributor

Oh, another part I should call out is static vs dynamic linking. My depth of understanding C++ linking stuff is fairly shallow, but in any case: IMHO, this detail should be transparent to users. e.g., as Python C extension library author, I should be able to change from one to the other (or provide both and let consumers choose) without anyone's Python code caring. This sort of functionality requires metadata propagation so that the py_binary level can put things together appropriately.

@rickeylev rickeylev changed the title Clarify how this module includes (or not) support for Python C headers and extension modules. Add support for building C extensions Nov 2, 2022
@ethanluoyc
Copy link

ethanluoyc commented Jun 20, 2023

@rickeylev I recently experimented with using bazel for compiling c++ extensions. I thought I could share some confusion I had when attempting to use bazel from a library development perspective.

Bazel has been used for building C extensions for quite a number of ML-related projects at Google (also DeepMind) see envlogger, reverb, launchpad, and also tensorstore from above @laramiel for examples.

I think at least for me personally, besides the lack of support for bundling extensions in py_wheel, it wasn't clear to me what would be the workflow looks like for creating python distributions with extensions using bazel: should we expect end users to (1) use bazel build and run the py_wheel rules or (2) maybe adopting bazel as a build backend as per PEP517.

(1) would be closest to the existing implementation of py_wheel, but it would probably not be very convenient for end users as they have to now learn a different workflow for installing dependencies built with bazel from source. I personally think 2) would ideally be the way to go but then it probably would take more effort to get there.

Thinking about 2), I realize there are at least a few things that weren't clear to me at the moment.

(2) suggests that a build frontend will create an isolated environment with the necessary build-time dependencies. I suppose in that case bazel should respect the installed interpreter dependencies and use those for building? Using installed dependencies is also convenient, for example, for libraries that depends on TensorFlow's libtensorflowframework.so so that you can avoid building TF from source.
Also, how should this interact with hermetic Python in theory? For library development, it's also quite likely you would want to build a separate version of the distribution for each ABI. In that case, it seems that we need to capability to configure different toolchains based on run-time configuration if we want to support PEP517, is this something that is possible to do now in Bazel? In addition, I wasn't sure how this should interact with the pip_parse rules.

Right now, the packages I mentioned above seems to adopt the python_configure approach. In that case, most of the libraries would require the users to pre-install the dependencies. The entire process is quite brittle overall with all sorts of quirks and hacks.

Personally, I think I would like a hermetic environment for development but still find an easy way to test the built distribution on a range of python versions.
Therefore, switching between a local_config_python and a hermetic python should be easy. From the documentation and examples, I couldn't find an example of how to do this, but maybe it is just because of my lack of understanding of how to configure it to work in this case. I saw there is a transition.bzl sitting around https://github.com/bazelbuild/rules_python/blob/main/python/config_settings/transition.bzl, is this something similar to what I was thinking about above?

@rickeylev
Copy link
Contributor

(2) is basically a superset of (1). In order for something like pep517 to use bazel as a backend, it has to configure the Python toolchains to use whatever is appropriate. In the end, it has to do e.g. bazel build :foo, where :foo is a py_wheel target (or some equivalent).

Note that:

  • Using bazel to build doesn't require that dependencies are built from source.
  • Using rules_python doesn't require using hermetic Python.

Also, how should this interact with hermetic Python in theory?

It interacts through toolchains. Toolchains are the abstraction rules use that allow someone else to configure what some set of inputs are to the rule. e.g. such as the location of Python header files.

For library development, it's also quite likely you would want to build a separate version of the distribution for each ABI. In that case, it seems that we need to capability to configure different toolchains based on run-time configuration if we want to support PEP517, is this something that is possible to do now in Bazel? In addition, I wasn't sure how this should interact with the pip_parse rules.

Essentially, yes. This is handled by transitions and constraints. As an example, you can define a rule that performs a split transition, which allows building the same target for multiple different configurations (e.g. different abis). Using constraints, targets and toolchains can control what their inputs are depending on mostly arbitrary criteria; this is how the multi-version python rules work today (the transition.bzl file you mentioned).

Right now, the packages I mentioned above seems to adopt the python_configure approach. In that case, most of the libraries would require the users to pre-install the dependencies. The entire process is quite brittle overall with all sorts of quirks and hacks.

Yes, correct on all counts. The main missing piece here is a toolchain to carry the C dependency information (e.g. headers, the :python_headers target in the toolchain repos generated by the hermetic rules).

Note that all this toolchain stuff is not specific to the rules_python hermetic python stuff. You can, for example, implement a repository that tries to locate Python on the local system and define a toolchain using that information. That's basically what those local_config_python things try to do. Downstream rules are largely unaware if a local or hermetic toolchain is used.

is [transition.bzl] something similar to what I was thinking about above?

Its in an adjacent area, yes. That code is an example of how to make multiple Python versions co-exist in one build. You could, for example, define a test_all_the_pythons() macro, which, under the hood, was basically a loop doing py_test(name = "foo_" + V, python_version=V), and then a single bazel test ... invocation would run all N tests concurrently under the different python versions.

@ethanluoyc
Copy link

ethanluoyc commented Jun 20, 2023

@rickeylev Thanks for the detailed reply!

Yes, correct on all counts. The main missing piece here is a toolchain to carry the C dependency information (e.g. headers, the :python_headers target in the toolchain repos generated by the hermetic rules).

One thing I wasn't sure how to do is to configure a pybind_extension from https://github.com/pybind/pybind11_bazel/tree/master/py. to simultaneously work with multiple toolchains. For the local_config_python approach the workflow mostly asks the user to call a local_config_python in WORKSPACE and a specific interpreter is bound there. Initially, I thought that it would be possible to do some sort of dynamic configuration using a toolchain that was registered but it does not seem as straightforward as I would have hoped (initially I thought if you have registered multiple toolchains there maybe an alias in the toolchain repo that can look up the correct python header and pass to the cc_binary but I wasn't able to find anything like this.) From what you described, ideally the correct thing to do here is to first allow the python toolchain to carry the header information and rewrite the pybind11_extension as a rule?

Another thing which I wasn't sure about is what's the right thing to handle requirements when bazel is configured to produce wheels. In the case of making bazel a PEP517 backend, then I would imagine the dependencies to be installed into build environment via the frontend, but the pip_parse rules will resolve the dependencies using the lock file (which may be different from the version retrieved during installation)

On an unrelated note, I encountered all of this as I was working on a PR here which uses bazel to compile native extensions. I managed to get things to work, but the entire process right now is pretty brittle and many things feel very unnatural.

@rickeylev
Copy link
Contributor

Well, I think I got the toolchain half working. A cc_library can just depend on @rules_python//python/cc:current_py_cc_headers to get the correct headers (see question at the end though). I need to add a couple more tests, and will be unavailable for a bit, but this is otherwise looking very promising!

Fiddling a bit more, I was able to use cc_shared_library to build an importable Python C extension. I didn't vet it thoroughly, but a cursory examination of the .so output looks plausibly correct (eg. libc is marked needed; though libstdc++ is also marked needed, which I thought py extensions were plain C and wouldn't need that? There's probably a compiler flag missing).

There's a fair amount of boiler plate to create the DSO, so we'll definitely want a rule to hide some of that. Also note that cc_shared_library is experimental still. I'm not sure how (or if) a regular cc_library can create an equivalent.

But, a few questions others better versed in C might know:

  • There's a :libpython target with DSOs. Should these also be made available? I'm assuming so?
  • There's a static library that currently isn't exposed. Should this also be made available?
  • The above two mean there are possibly 3 different "things" that need to be exposed -- should they be made available as separate things? e.g., if you want the headers, you depend on :current_py_cc_headers; if you want the DSOs, you depend on :current_py_cc_dynamic_libs, if you want the static libs, you depend on :current_py_cc_static_libs. (These all come through the same toolchain, it's just feeding that information into e.g. a cc_library requires an intermedate target to do the lookup and forward along the information).
  • There's also an "interface" library (a cc_import) that is windows only. I have no idea what this is; it looks internal, but the entire toolchain BUILD file is marked public, so I don't really know for sure.

@junyer
Copy link

junyer commented Jun 22, 2023

(eg. libc is marked needed; though libstdc++ is also marked needed, which I thought py extensions were plain C and wouldn't need that? There's probably a compiler flag missing).

Anything using pybind11 will be written in C++. ;)

  • There's also an "interface" library (a cc_import) that is windows only. I have no idea what this is; it looks internal, but the entire toolchain BUILD file is marked public, so I don't really know for sure.

FWIW, this bit of Windows documentation says:

An import library (.lib) file contains information the linker needs to resolve external references to exported DLL functions, so the system can locate the specified DLL and exported DLL functions at run time. You can create an import library for your DLL when you build your DLL.

I don't know whether it's possible (and reasonable) to "hide" that detail at the Bazel level such that users depend on a single target regardless of the platform.

@rickeylev
Copy link
Contributor

Anything using pybind11 will be written in C++. ;)

Yeah, but this was a hand-rolled Python C API extension, though. Basically a no-op extension. See foo.cc and its build file for how I was building it.

windows docs and hiding interface lib detail

Thanks for the doc link; explains a little bit, I think. The interface lib is already hidden, I think. It's a dependency of the target that holds the DSO files.

rickeylev added a commit to rickeylev/rules_python that referenced this issue Jun 28, 2023
rickeylev added a commit to rickeylev/rules_python that referenced this issue Jun 29, 2023
rickeylev added a commit to rickeylev/rules_python that referenced this issue Jun 29, 2023
Tihs allows getting a build's `cc_library` of Python headers through
toolchain resolution instead of having to use the underlying toolchain's
repository `:python_headers` target directly. This makes it easier and
more reliable to build C extensions with the correct Python runtime's
headers (e.g. the same ones for the runtime a binary uses).

Work towards bazelbuild#824
rickeylev added a commit to rickeylev/rules_python that referenced this issue Jun 29, 2023
Tihs allows getting a build's `cc_library` of Python headers through
toolchain resolution instead of having to use the underlying toolchain's
repository `:python_headers` target directly. This makes it easier and
more reliable to build C extensions with the correct Python runtime's
headers (e.g. the same ones for the runtime a binary uses).

Work towards bazelbuild#824
rickeylev added a commit to rickeylev/rules_python that referenced this issue Jun 29, 2023
This allows getting a build's `cc_library` of Python headers through
toolchain resolution instead of having to use the underlying toolchain's
repository `:python_headers` target directly.

Without this feature, it's not possible to reliably and correctly get
the C information about the runtime a build is going to use. Existing
solutions require carefully setting up repo names, external state,
and/or using specific build rules. In comparison, with this feature,
consumers are able to simply ask for the current headers via a special
target or manually lookup the toolchain and pull the relevant
information; toolchain resolution handles finding the correct headers.

The basic way this works is by registering a second toolchain to carry
C/C++ related information; as such, it is named `py_cc_toolchain`. The
py cc toolchain has the same constraint settings as the regular py
toolchain; an expected invariant is that there is a 1:1 correspondence
between the two. This base functionality allows a consuming rule
implementation to use toolchain resolution to find the Python C
toolchain information.

Usually what downstream consumers need are the headers to feed into
another `cc_library` (or equivalent) target, so, rather than have every
project reimplement the same "lookup and forward cc_library info" logic,
this is provided by the `//python/cc:current_py_cc_headers` target.
Targets that need the headers can then depend on that target as if it
was a `cc_library` target.

Work towards bazelbuild#824
rickeylev added a commit to rickeylev/rules_python that referenced this issue Jun 29, 2023
This allows getting a build's `cc_library` of Python headers through
toolchain resolution instead of having to use the underlying toolchain's
repository `:python_headers` target directly.

Without this feature, it's not possible to reliably and correctly get
the C information about the runtime a build is going to use. Existing
solutions require carefully setting up repo names, external state,
and/or using specific build rules. In comparison, with this feature,
consumers are able to simply ask for the current headers via a special
target or manually lookup the toolchain and pull the relevant
information; toolchain resolution handles finding the correct headers.

The basic way this works is by registering a second toolchain to carry
C/C++ related information; as such, it is named `py_cc_toolchain`. The
py cc toolchain has the same constraint settings as the regular py
toolchain; an expected invariant is that there is a 1:1 correspondence
between the two. This base functionality allows a consuming rule
implementation to use toolchain resolution to find the Python C
toolchain information.

Usually what downstream consumers need are the headers to feed into
another `cc_library` (or equivalent) target, so, rather than have every
project reimplement the same "lookup and forward cc_library info" logic,
this is provided by the `//python/cc:current_py_cc_headers` target.
Targets that need the headers can then depend on that target as if it
was a `cc_library` target.

Work towards bazelbuild#824
rickeylev added a commit to rickeylev/rules_python that referenced this issue Jun 29, 2023
This allows getting a build's `cc_library` of Python headers through
toolchain resolution instead of having to use the underlying toolchain's
repository `:python_headers` target directly.

Without this feature, it's not possible to reliably and correctly get
the C information about the runtime a build is going to use. Existing
solutions require carefully setting up repo names, external state,
and/or using specific build rules. In comparison, with this feature,
consumers are able to simply ask for the current headers via a special
target or manually lookup the toolchain and pull the relevant
information; toolchain resolution handles finding the correct headers.

The basic way this works is by registering a second toolchain to carry
C/C++ related information; as such, it is named `py_cc_toolchain`. The
py cc toolchain has the same constraint settings as the regular py
toolchain; an expected invariant is that there is a 1:1 correspondence
between the two. This base functionality allows a consuming rule
implementation to use toolchain resolution to find the Python C
toolchain information.

Usually what downstream consumers need are the headers to feed into
another `cc_library` (or equivalent) target, so, rather than have every
project reimplement the same "lookup and forward cc_library info" logic,
this is provided by the `//python/cc:current_py_cc_headers` target.
Targets that need the headers can then depend on that target as if it
was a `cc_library` target.

Work towards bazelbuild#824
rickeylev added a commit to rickeylev/rules_python that referenced this issue Jun 29, 2023
This allows getting a build's `cc_library` of Python headers through
toolchain resolution instead of having to use the underlying toolchain's
repository `:python_headers` target directly.

Without this feature, it's not possible to reliably and correctly get
the C information about the runtime a build is going to use. Existing
solutions require carefully setting up repo names, external state,
and/or using specific build rules. In comparison, with this feature,
consumers are able to simply ask for the current headers via a special
target or manually lookup the toolchain and pull the relevant
information; toolchain resolution handles finding the correct headers.

The basic way this works is by registering a second toolchain to carry
C/C++ related information; as such, it is named `py_cc_toolchain`. The
py cc toolchain has the same constraint settings as the regular py
toolchain; an expected invariant is that there is a 1:1 correspondence
between the two. This base functionality allows a consuming rule
implementation to use toolchain resolution to find the Python C
toolchain information.

Usually what downstream consumers need are the headers to feed into
another `cc_library` (or equivalent) target, so, rather than have every
project reimplement the same "lookup and forward cc_library info" logic,
this is provided by the `//python/cc:current_py_cc_headers` target.
Targets that need the headers can then depend on that target as if it
was a `cc_library` target.

Work towards bazelbuild#824
@rickeylev
Copy link
Contributor

I have draft #1287 and am working through CI tests. If there are particular features or behaviors that need to be exposed, now is a good time to let me know. Right now only the headers are made available. I'm not sure what to do with the other libs.

The basic way it works is defining another toolchain to carry and forward the C information. Right now, it only works with the hermetic runtimes. There's no particular reason it couldn't work with a system install, but someone else will have to contribute a repo rule to do that (the previously mentioned local_config_python stuff is basically that -- happy to review PR adding that to rules_python)

tag people who may not be following the convo closely: @rwgk @jblespiau @michaelreneer @laramiel

rickeylev added a commit to rickeylev/rules_python that referenced this issue Jun 29, 2023
This allows getting a build's `cc_library` of Python headers through
toolchain resolution instead of having to use the underlying toolchain's
repository `:python_headers` target directly.

Without this feature, it's not possible to reliably and correctly get
the C information about the runtime a build is going to use. Existing
solutions require carefully setting up repo names, external state,
and/or using specific build rules. In comparison, with this feature,
consumers are able to simply ask for the current headers via a special
target or manually lookup the toolchain and pull the relevant
information; toolchain resolution handles finding the correct headers.

The basic way this works is by registering a second toolchain to carry
C/C++ related information; as such, it is named `py_cc_toolchain`. The
py cc toolchain has the same constraint settings as the regular py
toolchain; an expected invariant is that there is a 1:1 correspondence
between the two. This base functionality allows a consuming rule
implementation to use toolchain resolution to find the Python C
toolchain information.

Usually what downstream consumers need are the headers to feed into
another `cc_library` (or equivalent) target, so, rather than have every
project reimplement the same "lookup and forward cc_library info" logic,
this is provided by the `//python/cc:current_py_cc_headers` target.
Targets that need the headers can then depend on that target as if it
was a `cc_library` target.

Work towards bazelbuild#824
rickeylev added a commit to rickeylev/rules_python that referenced this issue Jun 30, 2023
This allows getting a build's `cc_library` of Python headers through
toolchain resolution instead of having to use the underlying toolchain's
repository `:python_headers` target directly.

Without this feature, it's not possible to reliably and correctly get
the C information about the runtime a build is going to use. Existing
solutions require carefully setting up repo names, external state,
and/or using specific build rules. In comparison, with this feature,
consumers are able to simply ask for the current headers via a special
target or manually lookup the toolchain and pull the relevant
information; toolchain resolution handles finding the correct headers.

The basic way this works is by registering a second toolchain to carry
C/C++ related information; as such, it is named `py_cc_toolchain`. The
py cc toolchain has the same constraint settings as the regular py
toolchain; an expected invariant is that there is a 1:1 correspondence
between the two. This base functionality allows a consuming rule
implementation to use toolchain resolution to find the Python C
toolchain information.

Usually what downstream consumers need are the headers to feed into
another `cc_library` (or equivalent) target, so, rather than have every
project reimplement the same "lookup and forward cc_library info" logic,
this is provided by the `//python/cc:current_py_cc_headers` target.
Targets that need the headers can then depend on that target as if it
was a `cc_library` target.

Work towards bazelbuild#824
rickeylev added a commit to rickeylev/rules_python that referenced this issue Jun 30, 2023
This allows getting a build's `cc_library` of Python headers through
toolchain resolution instead of having to use the underlying toolchain's
repository `:python_headers` target directly.

Without this feature, it's not possible to reliably and correctly get
the C information about the runtime a build is going to use. Existing
solutions require carefully setting up repo names, external state,
and/or using specific build rules. In comparison, with this feature,
consumers are able to simply ask for the current headers via a special
target or manually lookup the toolchain and pull the relevant
information; toolchain resolution handles finding the correct headers.

The basic way this works is by registering a second toolchain to carry
C/C++ related information; as such, it is named `py_cc_toolchain`. The
py cc toolchain has the same constraint settings as the regular py
toolchain; an expected invariant is that there is a 1:1 correspondence
between the two. This base functionality allows a consuming rule
implementation to use toolchain resolution to find the Python C
toolchain information.

Usually what downstream consumers need are the headers to feed into
another `cc_library` (or equivalent) target, so, rather than have every
project reimplement the same "lookup and forward cc_library info" logic,
this is provided by the `//python/cc:current_py_cc_headers` target.
Targets that need the headers can then depend on that target as if it
was a `cc_library` target.

Work towards bazelbuild#824
rickeylev added a commit to rickeylev/rules_python that referenced this issue Jun 30, 2023
This allows getting a build's `cc_library` of Python headers through
toolchain resolution instead of having to use the underlying toolchain's
repository `:python_headers` target directly.

Without this feature, it's not possible to reliably and correctly get
the C information about the runtime a build is going to use. Existing
solutions require carefully setting up repo names, external state,
and/or using specific build rules. In comparison, with this feature,
consumers are able to simply ask for the current headers via a special
target or manually lookup the toolchain and pull the relevant
information; toolchain resolution handles finding the correct headers.

The basic way this works is by registering a second toolchain to carry
C/C++ related information; as such, it is named `py_cc_toolchain`. The
py cc toolchain has the same constraint settings as the regular py
toolchain; an expected invariant is that there is a 1:1 correspondence
between the two. This base functionality allows a consuming rule
implementation to use toolchain resolution to find the Python C
toolchain information.

Usually what downstream consumers need are the headers to feed into
another `cc_library` (or equivalent) target, so, rather than have every
project reimplement the same "lookup and forward cc_library info" logic,
this is provided by the `//python/cc:current_py_cc_headers` target.
Targets that need the headers can then depend on that target as if it
was a `cc_library` target.

Work towards bazelbuild#824
@ethanluoyc
Copy link

Weirdly, when I try to use the new py_cc_toolchain to configure a non-hermetic python I got some weird errors.

If I have something like

licenses(["restricted"])
load("@rules_python//python/cc:py_cc_toolchain.bzl", "py_cc_toolchain")

cc_library(
    name = "python_headers",
    hdrs = [],
)

py_cc_toolchain(
    name = "local_py_cc_toolchain",
    headers = ":python_headers",
    python_version = "3.10",
)

and on the commandline I use something like bazel build --extra_toolchains=//:local_py_cc_toolchain //:main it would raise the error saying

Misconfigured toolchains: //tools:local_py_cc_toolchain is declared as a toolchain but has inappropriate dependencies. Declared toolchains should be created with the 'toolchain' rule and should not have dependencies that themselves require toolchains.
.-> toolchain type @bazel_tools//tools/cpp:toolchain_type
|   RegisteredToolchains
|   //tools:local_py_cc_toolchain
|   //tools:python_headers
|   toolchain types @bazel_tools//tools/cpp:toolchain_type
`-- toolchain type @bazel_tools//tools/cpp:toolchain_type

I am mostly just mimicing the python repo BUILD file configured by rules_python but that works fine. I am not entirely sure what is going on..

@rickeylev
Copy link
Contributor

I noticed that it does not register the py_cc_toolchains by default though, is that intentional?

Nope, not intentional. Thanks for reporting. Are you using a WORKSPACE setup, and you have to call e.g. native.register_toolchains() somewhere? I may have overlooked updating the WORKSPACE logic, come to think of it.

weird errors

I think you're just missing the toolchain() call to glue "this is a toolchain" to the actual target that implements a toolchain. e.g. you need this:

bazel build --extra_toolchains=//:my_toolchain

toolchain(
  name = "my_toolchain",
  ...
  toolchain=":my_toolchain_impl",
)
py_cc_toolchain(
  name = "my_toolchain_impl",
  ...
)

@junyer
Copy link

junyer commented Aug 8, 2023

Although I can't speak authoritatively/officially for pybind11_bazel, as I said on pybind/pybind11_bazel#55, it would be nice to use @rules_python eventually. @rules_python//python/cc:current_py_cc_headers exists and @ethanluoyc demonstrated modifying PYBIND_DEPS to use it, which is great! However, as well as extending Python, pybind11_bazel supports embedding Python, which is currently implemented by @local_config_python's python_embed target, which has copts and linkopts that are obtained through python_configure.bzl's interrogation of the Python interpreter itself. Thus, I think the "ask" here is for an equivalent library target or, failing that, for a blessed/correct way of interrogating the Python interpreter itself. (Does the latter already exist? Could we use @rules_python//python:current_py_toolchain and $(PYTHON3)? 🤔)

@rickeylev
Copy link
Contributor

A data point from pyO3 (a rust-Python interop thing): https://github.com/PyO3/pyo3/blob/2d3dc654285eb8c9b29bb65b1d93323f551a8314/pyo3-build-config/src/impl_.rs#L223

(this came via a slack thread)

Basically, it runs a small python program to find a variety of pieces of information:

  • platform.python_implementation
  • from sysconfig.get_config_var: LDVERSION, LIBDIR, EXT_SUFFIX, Py_ENABLED_SHARED
  • pointer size (I think mypyc also wants this
  • sys.base_prefix
  • OS
    Among some others.

Running a sub-process to get that is non-ideal; it'd be better done through toolchain resolution, or most ideally, something available during the analysis phase.

supports embedding Python

Yes, this case I want to support, too. A tricky part of the support is a Python interpreter isn't necessarily required to make it work -- having an interpreter you can run is just the means to the end of getting the location of e.g. headers etc.

python_embed target

Just making sure -- is part of what it's trying to get is libpython.so? That's what it looks like. I saw that DSO sitting around, but wasn't sure what to do with it. It should be easy to add onto the py_cc_toolchain. It sounds like that would be helpful? I think this would replace the cc_library(name=python_lib) target you linked to.

which has copts and linkopts

What does this part look like? Internally, there's a variety of copts and linkopts we pass around to things, but what we do is unique in a few ways, so I'm not sure which parts are specific to our setup or just general things one has to set when building python stuff.

I think the "ask" here is for an equivalent library target

This info would live on either PyCcToolchain info, or some CcInfo it exposes (ala how it currently exposes headers). I think something that would be illustrative to me is knowing what else around python/repositories.bzl#_python_repository_impl should be doing to expose the necesasry information.

@junyer
Copy link

junyer commented Nov 30, 2023

PYTHON_EMBED_COPTS and PYTHON_EMBED_LINKOPTS come from _get_embed_flags(), which interrogates python-config for --cflags and --ldflags (--embed), respectively, and generates the following for python_embed on my workstation:

    linkopts = ["-L/usr/lib/python3.11/config-3.11-x86_64-linux-gnu -L/usr/lib/x86_64-linux-gnu -lpython3.11 -ldl  -lm "],
    copts = ["-I/usr/include/python3.11 -I/usr/include/python3.11  -Wsign-compare -g   -fstack-protector-strong -Wformat -Werror=format-security  -DNDEBUG -g -fwrapv -O2 -Wall"],

In this case, it does appear to want to get libpython3.11, yes, but I'm not sure how important all the other flags are for, say, ABI compatibility. libpython3.11.a and libpython3.11.so both exist, FWIW, but I don't know how much that matters here?

@junyer
Copy link

junyer commented Nov 30, 2023

Quoting pybind/pybind11_bazel#65 (comment):

What you really want is to get this information from the toolchain directly. e.g., in the regular rule code, […]

I have no idea whether this makes things easier, but pybind11_bazel doesn't have any Starlark rules – only Starlark macros. :)

@rickeylev
Copy link
Contributor

pybind11_bazel doesn't have any Starlark rules – only Starlark macros

I don't think it changes the situation too much. The two have different capabilities. A combination of macros and rules is quite plausible. I'll post a more detailed response on pybind/pybind11_bazel#64

get embed flags

Thanks. This is interesting and new 🙃 . So from what I can figure out, there's a pythonX.Y-config tool and it basically prints out the compile/link flags you probably want to use when embedding. Reading the embedding docs, it sounds like these settings are only intended for when embedding.

Since we're talking about embedding, what this means is a cc_binary wants the information about the Python runtime -- its headers, copts, linkopts, and shared libraries?

It's mostly feasible to, as part of setting up the downloaded runtime, run python-config and parse its output into something to stick into the generated build files. The two problems I see are:

  1. Some of the values python-config returns seem...out of its purview? For example, it returns -Wall -O3, which seems outside of what it should be deciding?
  2. This won't work for cross-platform building. e.g. you're on linux and are going to build for Mac. That means the Mac python-config has to run, but it can't on a linux machine.

@rickeylev
Copy link
Contributor

This won't work for cross-platform building.

Oh hm I had an idea of how this might be doable: I seem to recall that @<path> can be used to specify a file with additional command line args? So have python-config run as an action during a regular build, it puts its output (the flags) into a file, and then the path to that file is added to the flags that are put into CcInfo. I think a separate toolchain is needed to decouple the target-config and exec-config restrictions (this allows an arbitrary combination of target and exec configs), but that's not a big deal.

@junyer
Copy link

junyer commented Dec 1, 2023

Some of the values python-config returns seem...out of its purview? For example, it returns -Wall -O3, which seems outside of what it should be deciding?

Digging into this has been quite "enlightening". There are two versions of python-config: a Python script; and a shell script. With the Python script, --cflags is --includes plus the CFLAGS configuration variable as per the sysconfig module; see here. With the shell script, --cflags is --includes plus $BASECFLAGS, $CFLAGS and $OPT, whose values were baked in at install time; see here. There are also two pkg-config .pc files: one for extending; and one for embedding. Using those could actually be preferable for our purposes, I suspect, but I should note that they differ subtly; see here and here, respectively. It would arguably be useful to study how, say, https://github.com/cherrry/bazel_pkg_config uses pkg-config.

I seem to recall that @<path> can be used to specify a file with additional command line args?

Indeed, Bazel seems to prefer doing so when linking due to the length of the command line. For example, _re2.so-2.params on my workstation is 217 lines.

@rickeylev
Copy link
Contributor

There are also two pkg-config .pc files

Nice! Those look very promising. They look easy enough to parse so that e.g. a linux host could parse them to generate the settings for mac. Do those give all necessary info? They seem much shorter than the shell script.

they differ subtly;

Yeah, I noticed that too. Something that isn't clear to me is what re-using a cc_library for both cases looks like, e.g.:

cc_binary(name="embed", deps=[":foo", ":python_runtime_lib"])
py_binary(name="extend", deps=[":foo"])
cc_library(name="foo", ...) # makes "import foo" work
cc_library(name="python_runtime_lib", ...)

Does some setting of the :foo target itself need to change based on whether it's being embedded or not (e.g. linkopts or copts)? e.g. is copts = select(:embedded: X, :nonembeded: Y) needed? If not, then I think we can push that info into the toolchain and its easy. If so, then, something has to perform a transition to set that.

I think the python_runtime_lib would only be useful for embedding? So maybe naming it python-embedded_runtime_lib makes more sense. I can't imagine another context it would be used for. Maybe if one was building Python from source and not embedding?

@junyer
Copy link

junyer commented Dec 5, 2023

Nice! Those look very promising. They look easy enough to parse so that e.g. a linux host could parse them to generate the settings for mac. Do those give all necessary info? They seem much shorter than the shell script.

I don't know for sure, TBH, but it would seem like a bug to report upstream (i.e. to the Python project itself) if embedding has missing flags; it seems pretty clear – from the .pc file and also from actual experience – that extending needs headers only. OTOH, I have no idea about embedding on Windows, so interrogating the sysconfig module for configuration variables may or may not turn out to be needed anyway. (BTW, pybind11_bazel doesn't currently claim to support embedding on Windows.)

Does some setting of the :foo target itself need to change based on whether it's being embedded or not (e.g. linkopts or copts)?

I don't think so...

I think the python_runtime_lib would only be useful for embedding?

That's my understanding.

@rickeylev
Copy link
Contributor

pkg-config pc files

It looks like standalone python might help a bit: PYTHON.json

https://gregoryszorc.com/docs/python-build-standalone/main/distributions.html#python-json-file

It has...a lot of info. I think i saw link flags in there? And since its json, we can just directly load it in starlark without having to write a special parser.

@junyer
Copy link

junyer commented Dec 21, 2023

Whoa, awesome! I don't see any flags except for inittab_cflags, but I suppose python_config_vars must have them. And we could always just ask @indygreg? :)

@cloudhan
Copy link

py_cc_toolchain needs to allow users to use non-hermatic host python. This is allow pybind11_bazel to move to rules_python first...

@mering
Copy link

mering commented Jan 2, 2024

FYI gRPC has Bazel Cython support: https://github.com/grpc/grpc/blob/master/bazel/cython_library.bzl

rickeylev added a commit to rickeylev/rules_python that referenced this issue Jan 22, 2024
This exposes the runtime's C libraries throught the py cc toolchain.
This allows tools to embed the Python runtime or otherwise link
against it.

Work towards bazelbuild#824
github-merge-queue bot pushed a commit that referenced this issue Jan 24, 2024
This exposes the runtime's C libraries through the py cc toolchain. This
allows tools to embed the Python runtime or otherwise link against it.

It follows the same pattern as with the headers: the toolchain consumes
the cc_library,
exports CcInfo, and a `:current_py_cc_libs` target makes it easily
accessible to users.

Work towards #824

* Also upgrades to rules_testing 0.5.0 to make use of rules_testing's
DefaultInfoSubject
rickeylev added a commit to rickeylev/rules_python that referenced this issue Feb 12, 2024
This is just a basic to to ensure the
`@rules_python//python/cc:current_py_cc_libs` target is making the
Python C libraries available; that the compiler and linker are both
happy enough to resolve symbols.

The test program itself isn't actually runnable, as that requires
more configuration than I can figure out in telling Python where its
embedded runtime information is.

Work towards bazelbuild#824
@rickeylev
Copy link
Contributor

one update, one response:

There's pybind11_bazel PR to move onto the current_py_cc_{headers,libs} targets. It looks pretty promising. See pybind/pybind11_bazel#72

I put together a small test in #1749 to verify that current_py_cc_libs is working. It's not very comprehensive, but it at least shows it's doing something right (linker and compiler are happy). A real end-to-end test for embedding would be great. I'm very happy to review PRs fleshing out our testing in this area.

re: @cloudhan

py_cc_toolchain needs to allow users to use non-hermatic host python. This is allow pybind11_bazel to move to rules_python first...

This is possible today, but it requires setting up your own toolchain. This basically translates to writing a repo rule to symlink/copy things on the local machine into a repo to fool various bazel things elsewhere. I have a basic impl to make using a local runtime easy at https://github.com/rickeylev/rules_python/tree/local.py.toolchains, and it appears to work, however (1) it's hung up integrating it into bzlmod (just tedious refactoring work) and (2) correctly finding the various library paths is surprisingly non-trivial (some comments in my branch link to an SO post that talks about it; I think this will just require some iterations and bug reports from different systems).

github-merge-queue bot pushed a commit that referenced this issue Feb 12, 2024
…1749)

This is just a basic to to ensure the
`@rules_python//python/cc:current_py_cc_libs` target is making the
Python C libraries available; that the compiler and linker are both
happy enough to resolve symbols.

The test program itself isn't actually runnable, as that requires more
configuration than I can figure out in telling Python where its embedded
runtime information is.

Fow now, the test is restricted to Linux. The Mac and Windows CC
building isn't happy with it, but I'm out of my depth about how to make
those happy.

Work towards #824
@fzakaria
Copy link

fzakaria commented Jun 25, 2024

I just went through the process of building pynacl, and cffi using rules_python.
I'll dump what I did here if it helps someone else.

First here is cffi.BUILD

# cffi needs to import a _cffi_backend.so so let's make it
# it's not 100% clear when to use cc_shared_library vs cc_binary with linkshared = True
cc_shared_library(
    name = "_cffi_backend",
    shared_lib_name = "_cffi_backend.so",
    deps = [
        ":cffi_backend",
    ],
   # Although rules_python provides @rules_python//python/cc:current_py_cc_libs
   # you can't use it in dynamic_deps since it's not a cc_shared_library
   # so let's just disable validating symbols with -undefined dynamic_lookup on MacOS
   # You'll want the equivalent Linux 
    user_link_flags = select({
        "@platforms//os:macos": ["-undefined", "dynamic_lookup", "-F/System/Library/Frameworks"],
        "//conditions:default": [],
    }),
)

cc_library(
    name = "cffi_backend",
    srcs = [
        "c/_cffi_backend.c",
    ],
    hdrs = glob(["c/*.h"]) + [
        "_cffi_errors.h",
        "parse_c_type.h"
    ],
    textual_hdrs = glob([
        "c/*.c"
    ], exclude = ["c/_cffi_backend.c"]),
    deps = [
        "@rules_python//python/cc:current_py_cc_headers",
        "//third_party/c/libffi"
    ],
)

py_library(
    name = "cffi",
    srcs = glob(["*.py"]),
    data = [
        "//third_party/python/cffi:_cffi_backend",
    ],
    imports = [
        # Needed to import the _cffi_backend.so from cffi
        "../../../third_party/python/cffi",
    ],
)

Finally we can create pynacl.BUILD

py_library(
    name = "nacl",
    srcs = glob(["*.py", "bindings/*.py"]) + [":_sodium"],
    data = [
        "//third_party/c/libsodium:libsodium_shared",
        "//third_party/python/cffi:_cffi_backend"
    ],
    imports = [
        "../../../third_party/python",
        # Needed to import the _cffi_backend.so from cffi
        "../../../third_party/python/cffi",
    ],
)


py_binary(
    name = "build",
    srcs = ["c_bindings/build.py"],
    deps = [
        "//third_party/python/cffi",
        "//third_party/python/pycparser"
    ],
    data = glob(["c_bindings/**/*.h"]),
    imports = [
        "../../../third_party/python",
    ],
)

# pynacl runs the build.py to create the FFI bindings.
# you can run cffi in various modes. Here we are running it 'out of band' but generating a python script 
# not the C file.
# An improvement would be to generate the C file and compile that itself into the shared object
genrule(
    name = "_sodium",
    outs = ["_sodium.py"],
    cmd = "./$(location build); mv _sodium.py $(@); echo 'lib = ffi.dlopen(\"third_party/c/libsodium/libsodium.so\")' >> $(@)",
    tools = [
        ":build"
    ]
)

py_test(
    name = "test_nacl",
    srcs = ["test_nacl.py"],
    deps = [
        ":nacl",
        "//third_party/python/pytest",
    ],
)

@rickeylev
Copy link
Contributor

Nice!

it's not 100% clear when to use cc_shared_library vs cc_binary with linkshared = True

cc_binary will always link all transitive dependencies while cc_shared_library give you some control over what is linked in. You almost certainly want to use cc_shared_library (part of its origin is tensorflow wanting to build python C extensions).

Although rules_python provides @rules_python//python/cc:current_py_cc_libs
you can't use it in dynamic_deps since it's not a cc_shared_library
so let's just disable validating symbols with -undefined dynamic_lookup on MacOS
You'll want the equivalent Linux

I think this is WAI and you have it right. IIUC, mac linkers, by default, want to verify things at link time, while linux ones don't, hence the extra flag to diable that. I found this blog post that explains a bit. Based on that post, you also don't want current_py_libs because you're not embedding python (ie. the whole interpreter). Instead, the Python symbols will be found at runtime as part of loading the C extension into the Python process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants