Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use custom build_ext to compile Cython code #21600

Closed
jdemeyer opened this issue Sep 26, 2016 · 52 comments
Closed

Use custom build_ext to compile Cython code #21600

jdemeyer opened this issue Sep 26, 2016 · 52 comments

Comments

@jdemeyer
Copy link

In src/setup.py, we should not run cythonize() manually. It would be better to use a custom build_ext command for that.

This is inspired by the build_ext from Cython 0.25: https://github.com/cython/cython/blob/master/Cython/Distutils/build_ext.py
Unfortunately, we cannot really use that since it doesn't allow to pass options to cythonize().

Due to a Cython bug, this warning gets displayed even though the cache option is valid: UserWarning: got unknown compilation option, please remove: cache.

Depends on #21604

CC: @mkoeppe @embray

Component: build

Author: Jeroen Demeyer

Branch/Commit: dfcd817

Reviewer: Matthias Koeppe, Erik Bray

Issue created by migration from https://trac.sagemath.org/ticket/21600

@jdemeyer jdemeyer added this to the sage-7.4 milestone Sep 26, 2016
@jdemeyer

This comment has been minimized.

@jdemeyer jdemeyer changed the title Use Cython's build_ext Use custom build_ext to compile Cython code Sep 27, 2016
@jdemeyer

This comment has been minimized.

@jdemeyer
Copy link
Author

Branch: u/jdemeyer/ticket/21600

@jdemeyer
Copy link
Author

New commits:

5a55862Fix build of pyzmq with Cython 0.25
85634c2Upgrade Cython to version 0.25
5474247Move old_style_globals to modules; other directives to cythonize() call
6d6d349Run cythonize() inside build_ext command

@jdemeyer
Copy link
Author

Commit: 6d6d349

@jdemeyer
Copy link
Author

Changed dependencies from #20596 to #21604

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Sep 28, 2016

Changed commit from 6d6d349 to 35ecf4b

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Sep 28, 2016

Branch pushed to git repo; I updated commit sha1. This was a forced push. Last 10 new commits:

3a8cc0eFix typo in comment
0dd9c50Respect environment variable MAKE
17f90d8beautification
e5f9065More comments
7791cd9Remove --buildbase code
74169e7Pass SAGE_SRC to generate_py_source.mk
0394333Add new file to MANIFEST.in
fdedb02Install SAGE_LOCAL/bin/sage instead of SAGE_ROOT/sage as Jupyter kernel
5ba95edClean up stale installed files in install command
35ecf4bRun cythonize() inside build_ext command

@embray
Copy link
Contributor

embray commented Sep 29, 2016

comment:8

If I'm to understand correctly, this is mostly just taking the existing Cython build code and moving into the build_ext command, right? Or was anything substantive changed in the build code too (beyond what was already from #21604)?

@jdemeyer
Copy link
Author

comment:9

Replying to @embray:

If I'm to understand correctly, this is mostly just taking the existing Cython build code and moving into the build_ext command, right?

Exactly. But it interacts in non-trivial ways with other parts of setup.py (in particular, finding the data files), so this isn't ready yet.

@embray
Copy link
Contributor

embray commented Sep 30, 2016

comment:10

Could you be more specific? I can probably help.

@jdemeyer
Copy link
Author

comment:11

Well, I haven't figured out precisely what to do with this part from src/setup.py:

from sage_setup.find import find_python_sources, find_extra_files
python_packages, python_modules = find_python_sources(
    SAGE_SRC, ['sage', 'sage_setup'])
python_data_files = find_extra_files(python_packages,
    ".", SAGE_CYTHONIZED, SAGE_LIB, ["ntlwrap.cpp"])

This code determines some inputs to the setup() call, so logically it should come before calling setup().

On the other hand, the last line assumes that Cython has run because it looks for files in SAGE_CYTHONIZED. So, if we run Cython in the build_ext command of setup() (what this ticket already implements), then we can no longer determine the python_data_files before calling setup().

@embray
Copy link
Contributor

embray commented Sep 30, 2016

comment:12

One secret about distutils, which is both a bug and a feature--and the fact that it's often exploited is why it's so hard to change anything about distutils--is that in many cases it doesn't really matter what you pass to setup(). It's mostly important for static metadata, but most other details about building the package can be tweaked in various places along the line (albeit sometimes requiring care).

So for example, the data_files passed to setup() is stored on the Distribution object in the .data_files attribute. So this can be accessed from any command via self.distribution.data_files. If a command, like build_py or build_ext is also responsible for generating files, some of which might result in installable resource files (for Cython modules this is mainly header files), it can also append those generated files, as appropriate, to self.distribution.data_files for later use.

Where this requires care is it's sometimes important to make sure you're using those attributes in ways that are compatible with how other commands use them, for which the best documentation is the source code (another misfeature). data_files is an easy one because it's only used by a couple other commands, mainly install/install_data, and sdist. This is easy because the build command is normally run automatically before install, so as long as the build_ext command updates the data_files appropriately they will be handled correctly.

sdist is a little trickier--it does not normally run build first, though if you invoke setup.py with multiple commands it will work correctly (like ./setup.py build sdist). I normally subclass sdist to have it automatically run build first if I have Cython modules, because I also want to include the C sources in my source dist. But I'm not too worried about that right now since ./setup.py sdist doesn't work right for Sage for other reasons too. That will be a separate issue.

@jdemeyer
Copy link
Author

comment:13

Thanks a lot! I think I have enough information now to work on this.

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Oct 4, 2016

Changed commit from 35ecf4b to 2b5fa98

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Oct 4, 2016

Branch pushed to git repo; I updated commit sha1. New commits:

2b5fa98Copy extra files to build directory instead of using data_files

@embray
Copy link
Contributor

embray commented Oct 4, 2016

comment:15

I'm not quite sure why you're not not using data_files at all, though for files installed with the Python modules themselves it's more customary to use package_data but that's neither here nor there. I suspect the problem might have to do in part with oddity in how find_extra_files is expected to work.

I'm not sure why the dist.extra_files = either. There's no reason now to go assigning arbitrary new attributes to the Distribution instance.

@jdemeyer
Copy link
Author

jdemeyer commented Oct 4, 2016

comment:16

Replying to @embray:

I'm not quite sure why you're not not using data_files at all

I could have done that, but the usage of data_files in Sage has always been a bit fishy, see #20108. So given that I'm changing the data file handling anyway, I decided to stop using data_files.

though for files installed with the Python modules themselves it's more customary to use package_data

That doesn't work either: package_data assumes that the package data files are in the same directory as the Python sources. This is not the case for the Cython-generated files, so I cannot use package_data. But I am processing the data files similarly as what distutils does with package_data: copying them to the build directory. Since the data files appear in the build directory, they are automatically installed in site-packages.

I suspect the problem might have to do in part with oddity in how find_extra_files is expected to work.

No, we can always change find_extra_files to do whatever we want it to do.

I'm not sure why the dist.extra_files = either. There's no reason now to go assigning arbitrary new attributes to the Distribution instance.

These extra_files are also needed for cleaning in the sage_install class. So I need to pass them somehow and assigning an attribute to the Distribution instance seemed like the most natural way.

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Oct 4, 2016

Branch pushed to git repo; I updated commit sha1. New commits:

7e3a492Rename extra_files -> sage_extra_files

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Oct 4, 2016

Changed commit from 2b5fa98 to 7e3a492

@embray
Copy link
Contributor

embray commented Oct 4, 2016

comment:19

Replying to @jdemeyer:

Replying to @embray:

though for files installed with the Python modules themselves it's more customary to use package_data

That doesn't work either: package_data assumes that the package data files are in the same directory as the Python sources. This is not the case for the Cython-generated files, so I cannot use package_data. But I am processing the data files similarly as what distutils does with package_data: copying them to the build directory. Since the data files appear in the build directory, they are automatically installed in site-packages.

Since the purpose of package_data is specifically to list data files that should be installed in the Python packages, I think it makes sense to use anyways. Normally this wouldn't be a problem even for Cython generated files, and the only reason it is is because we're handling those in a "weird" (to distutils) and unexpected way. That's fine--I mean, I don't agree with it but it's not abhorrent or anything--we just need to make some tweaks to how package_data is handled so that it searches additional file hierarchies.

But for coherency I would rather use the existing option that is specifically for this purpose, and use our command subclasses to enhance its functionality to work for our purposes, rather than make up something entirely new.

@jdemeyer

This comment has been minimized.

@jdemeyer
Copy link
Author

jdemeyer commented Oct 4, 2016

comment:21

Replying to @embray:

we just need to make some tweaks to how package_data is handled so that it searches additional file hierarchies.

There is another issue: package_data is handled in the build_py command, which happens before build_ext, which runs Cython. So even if you manage to handle the Cython-generated files using package_data, you either need to reorder the build subcommands or you need to run Cython in build_py.

But for coherency I would rather use the existing option that is specifically for this purpose

You mean: going back to use data_files?

@embray
Copy link
Contributor

embray commented Oct 5, 2016

comment:22

I see, that's a bit annoying. I'm surprised I haven't run into that before. If I have I don't know what solution I used--maybe just manually copying the generated Cython files to the appropriate build path. It would still not be a bad idea to add relevant entries to package_data though since that's also used by the sdist command to ensure that all those files are included in the source tarball (though this can also be achieved simply with the correct MANIFEST.in rule).

Looking at the source again, I think the simplest thing to do would be after Cythonization to copy the appropriate generated files to the appropriate build/ directory (basically emulating what build_py.build_package_data is doing, which you basically already have. But then also add entries for either of:

  1. `self.get_finalized_command('build_py').data_files or
  2. `self.distribution.data_files

The former is the same as if those files were included in package_data in the first place.

An alternative I kind of like, but that's more complicated, is to make cythonize and entirely separate command (build_cython or something) and insert that into the front of the build command's sub-commands. But that's more complicated and probably not worth it for now.

My only real problem at this point is with the new attribute you're tacking on. It seems easily avoidable.

@embray
Copy link
Contributor

embray commented Oct 5, 2016

comment:27

Hmmm...I think I can appreciate what the clean_stale_files stuff is for. In most projects I would just blow away the build directory, or just individual files in it if applicable. But sage is so large this is a real pain to do--that's something which itself could use a better approach but that's well beyond this ticket.

I don't think that should just happen when you "install" though, but really prior to any build.

@jdemeyer
Copy link
Author

jdemeyer commented Oct 5, 2016

comment:28

Replying to @embray:

Hmmm...I think I can appreciate what the clean_stale_files stuff is for.

This isn't part of this ticket, but #21604.

@mkoeppe
Copy link
Contributor

mkoeppe commented Oct 7, 2016

comment:29

I've tested it (on top of 7.4.beta6), and it seems to work for me.
Erik, objections to setting this ticket to "positive review"?

@mkoeppe
Copy link
Contributor

mkoeppe commented Oct 7, 2016

Reviewer: Matthias Koeppe

@jdemeyer
Copy link
Author

jdemeyer commented Oct 8, 2016

comment:30

Replying to @embray:

  1. Just because distutils is "meant to be" tinkered with in a flexible manner, monkeypatching the Distribution option is definitely an anti-pattern.

Adding attributes to a class is a completely normal thing when subclassing. In this case, I could make a trivial subclass

class SageDistribution(Distribution):
    pass

and define that SageDistribution supports a sage_extra_files attribute. Because of the way Python works though, there is no need for such a subclass.

So, I still don't get why it's perfectly fine to have dist.data_files (what distutils itself uses) but not dist.sage_extra_files. They serve very analogous purposes, so why not implement them analogously?

@embray
Copy link
Contributor

embray commented Oct 10, 2016

comment:31

Please--you don't have to explain to me how Python works.

The point I'm trying to make is that the way you're doing this is antithetical to the common programming patterns used specifically in the context of writing distutils commands. I'm still not convinced that you need this extra attribute at all. But for now I'd be satisfied if it were implemented as I described above. If that's not clear I'll just do it.

@jdemeyer
Copy link
Author

comment:32

Replying to @embray:

Please--you don't have to explain to me how Python works.

Of course I don't need to explain Python. I wanted to explain the philosophy behind what I was doing.

I'm still not convinced that you need this extra attribute at all.

And I'm still not convinced that what I'm doing is somehow wrong.

But for now I'd be satisfied if it were implemented as I described above.

I'll try it.

@embray
Copy link
Contributor

embray commented Oct 10, 2016

comment:33

BTW I should add--if what I suggested is not clear that's of course my fault for explaining poorly. My point is that it's not good practice in general to use classes as dumping grounds for arbitrary attributes. Python allows it, sure, but it's almost always confusing and error-prone. For example it's possible to run the "install" command without running "build_py" first, in which case you'll get an attribute error. There are myriad little corner cases like that in distutils.

I think instead of "sage_extra_files" it might be fine to call this "cythonized_files" or something like that, and have it be an attribute of the custom "build_ext". That way it's clearer exactly what this is about (and that the only reason it needs to exist at all is due to how we're putting the cythonized files in a separate, parallel directory structure).

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Oct 10, 2016

Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:

dfcd817extra_files -> sage_build_ext.cythonized_files

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Oct 10, 2016

Changed commit from 7e3a492 to dfcd817

@jdemeyer
Copy link
Author

comment:35

Needs review.

@embray
Copy link
Contributor

embray commented Oct 10, 2016

comment:36

This looks along the right lines. It occurs to me there's something kind of problematic here, which is that it runs run_cython in finalize_command. finalize_command is basically supposed to be like an __init__--it should just set default values for things and not actually have much in the way of side-effects.

run_cython() should happen at the beginning of the run(). finalize_command should just set cythonized_files to a default value (such as an empty list or None). I think I see why you did things the way you did--you wanted to make sure one way or another that find_extra_files was run. I might instead move the find_extra_files call into a method on build_ext (such as find_cythonized_files() or something). When called after run_cython() that could update the cythonized_files attribute, and then be used to copy the files. But from the install command you would just called find_cythonized_files() if cythonized_files isn't already set.

@embray
Copy link
Contributor

embray commented Oct 10, 2016

comment:37

Something like

class sage_build_ext(build_ext):
    def finalize_options(self):
        ...
        self.cythonized_files = None

    def run(self):
        self.run_cython()

        # The following two lines, or the equivalent thereof, could be called
        # directly from run_cython() too
        self.find_cythonized_files()
        self.copy_cythonized_files()
        build_ext.run(self)

    def find_cythonized_files(self):
        if self.cythonized_files is None:
            ...
            self.cythonized_files = ...


class sage_install(install):
    def clean_stale_files(self):
        ...
        cmd_build_ext = self.get_finalized_command('build_ext')
        cmd.find_cythonized_files()

        # Do stuff with cmd.find_cythonized_files()

And the rest pretty much exactly as you have it. Would that work?

It's not pretty but with distutils you have to write code like you're writing for Python 2.1 or something :)

@jdemeyer
Copy link
Author

comment:38

Replying to @embray:

This looks along the right lines. It occurs to me there's something kind of problematic here, which is that it runs run_cython in finalize_command.

That's exactly what upstream Cython 0.25 does (which was the inspiration for this ticket), so I'm not inclined to change this.

@mkoeppe
Copy link
Contributor

mkoeppe commented Oct 10, 2016

comment:39

Erik, on this ticket (and elsewhere) you have mentioned various subtle issues of setup.py commands and their interaction. I've opened ticket #21678 (Testsuite for src/setup.py) to formalize this, and to make sure that everything that is implemented correctly on the present ticket, according to your comments, will not be lost by regressions in the future.

@embray
Copy link
Contributor

embray commented Oct 11, 2016

comment:40

Then I'll have to bother upstream Cython about this. There may be a reason they're doing it that way, but it doesn't make a lot of sense on its face.

The point of finalize_command is to initialize any options / set up defaults, not actually do anything. "But they do it" is not a good reason if we don't know why they're doing it.

@embray
Copy link
Contributor

embray commented Oct 11, 2016

comment:41

If I had to guess, the reason they're doing it that way is because of how the "sdist" command uses the "build_ext" command to get a list of source files to include in the sdist. I think the right way to do that is not really to run cythonize() in the finalize_command() method, but rather to modify the sdist command so that it runs cythonize() as a sub-command.

@jdemeyer
Copy link
Author

comment:42

Replying to @embray:

Then I'll have to bother upstream Cython about this.

Please do that.

@embray
Copy link
Contributor

embray commented Oct 11, 2016

comment:43

Replying to @jdemeyer:

Replying to @embray:

Then I'll have to bother upstream Cython about this.

Please do that.

I will try to work with them more closely on that.

In the interest of harmony and in making progress, if this is working for you as is, then I give it a positive review. I'll open a separate ticket with my ideas for how to build on this to make it a bit cleaner, but I don't think it's bad as is as long as it works. I just know we can do better.

@jdemeyer
Copy link
Author

Changed reviewer from Matthias Koeppe to Matthias Koeppe, Erik Bray

@jdemeyer
Copy link
Author

comment:44

I'll let the other reviewer (Matthias Koeppe) decide on giving positive review.

@embray
Copy link
Contributor

embray commented Oct 11, 2016

comment:45

See #21682

@mkoeppe
Copy link
Contributor

mkoeppe commented Oct 11, 2016

comment:46

Tested on top of 7.4.rc0.

@vbraun
Copy link
Member

vbraun commented Oct 21, 2016

Changed branch from u/jdemeyer/ticket/21600 to dfcd817

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants