Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix incompatibility with py3 in autogen/pari #22044

Closed
fchapoton opened this issue Dec 11, 2016 · 27 comments
Closed

fix incompatibility with py3 in autogen/pari #22044

fchapoton opened this issue Dec 11, 2016 · 27 comments

Comments

@fchapoton
Copy link
Contributor

caused by #21613

CC: @embray @mkoeppe @jdemeyer

Component: python3

Author: Frédéric Chapoton

Branch: 02c5061

Reviewer: Jeroen Demeyer

Issue created by migration from https://trac.sagemath.org/ticket/22044

@fchapoton fchapoton added this to the sage-7.5 milestone Dec 11, 2016
@fchapoton
Copy link
Contributor Author

Branch: u/chapoton/22044

@fchapoton
Copy link
Contributor Author

Commit: 02c5061

@fchapoton
Copy link
Contributor Author

New commits:

02c5061py3 fixing src/sage_setup/autogen/pari/__init__.py

@jdemeyer
Copy link

Reviewer: Jeroen Demeyer

@embray

This comment has been minimized.

@embray
Copy link
Contributor

embray commented Dec 12, 2016

comment:3

This does't look right to me. Instead you should change pari_share to return a string instead of bytes (by decoding the output with the system's filesystem encoding): https://github.com/sagemath/sagetrac-mirror/blob/develop/src/sage_setup/autogen/pari/parser.py?id=02c50619cbb13a5c8e32a604e2e55642f5fbad4e#n41

@embray

This comment has been minimized.

@fchapoton
Copy link
Contributor Author

comment:5

Please take care of that in whatever way you want.

@jdemeyer
Copy link

comment:6

Replying to @embray:

Instead you should change pari_share to return a string instead of bytes

Why introduce an extra decoding/encoding step? The native format for filenames is bytes since that is what the system calls accept. If Python has to convert the filename to bytes anyway, why not just feed it bytes?

@embray
Copy link
Contributor

embray commented Dec 12, 2016

comment:8

Well for one, the patch as it stands is creating an inhomogeneous list containing bytes and str object. Of course, you can do that, but it's a little surprising and confusing. Either treat filenames as byte strings everywhere, or nowhere (but I suggest nowhere because it is user- and developer-hostile to have to look at bytes objects containing non-ASCII character data :) In fact it's more common to deal with filenames as str objects in Python for exactly this reason, except for right at the system boundary, or when dealing directly with binary data.

In fact there is no "native format for filenames" in Python in part because it's more complicated than that especially when you consider Windows--there is no one right answer to that question. Which is why all the stdlib functions which take a filesystem path accept both bytes and unicode.

In any case, the general pattern in Python 3 with functions at the system boundary is to immediately convert from bytes to text, use text, and the convert back to bytes when going back out to the system (which in many cases, like opening files, happens transparently in Python). The only exception is when those bytes will never, ever be used in another context, like working with file and network protocols directly. The extra encoding/decoding steps are trivial otherwise.

@embray
Copy link
Contributor

embray commented Dec 12, 2016

comment:9

FWIW an easy enough workaround is to pass universal_newlines=True to the subprocess.Popen. On Python 2 this doesn't make an enormous difference, especially in Sage since it's not likely going to be calling any programs that output CRLFs. But on Python 3 it does make a difference since it wraps the streams in TextIOWrappers, which decode bytes to text, by default using locale.getpreferredencoding() which is generally appropriate.

@jdemeyer
Copy link

comment:10

Replying to @embray:

Either treat filenames as byte strings everywhere, or nowhere (but I suggest nowhere because it is user- and developer-hostile to have to look at bytes objects containing non-ASCII character data :) In fact it's more common to deal with filenames as str objects in Python for exactly this reason, except for right at the system boundary, or when dealing directly with binary data.

I don't get what the problem is with treating filenames as bytes everywhere. It seems to me that things are least error-prone if you avoid conversions bytes <-> str.

FWIW an easy enough workaround[...]

When you use bytes, you don't need a workaround at all.

@embray
Copy link
Contributor

embray commented Dec 12, 2016

comment:11

It's not really a "workaround"--poor choice of words. What I mean is it's an easy way to go from binary to text in that context.

It seems to me that things are least error-prone if you avoid conversions bytes <-> str.

That's where you're wrong. The further away bytes get from their original source, the harder it is to know where they came from or how they should be interpreted. This compounded when you have bytes coming from multiple sources, in possibly different encodings, and you try to combine them while ignoring where they came from. Many examples of where this can go wrong are discussed here: http://python-notes.curiousefficiency.org/en/latest/python3/questions_and_answers.html#why-not-just-assume-utf-8-and-avoid-having-to-decode-at-system-boundaries and even in the rather recent PEP 529: https://www.python.org/dev/peps/pep-0529/

You're not in poor company having doubts about this. Armin Ronacher has written extensively about it, such as here: http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/ I don't agree with him (and for some reasons he doesn't give in his list of possible objections), but he's not without a point, and in fact has since pushed for many useful updates to Python 3 to make dealing with bytes and strings a bit less of a hassle.

@embray
Copy link
Contributor

embray commented Dec 12, 2016

comment:12

Or if you want just a more practical argument, going back to my point that this is mixing bytes and strings in a single list, this easily results in something like:

>>> paths = [b'/usr/lib', '/usr/local/lib']
>>> filename = [os.path.join(p, 'libfoo.so') for p in paths]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <listcomp>
  File "/usr/lib/python3.4/posixpath.py", line 89, in join
    "components") from None
TypeError: Can't mix strings and bytes in path components

Although that may or may not apply in this specific case, it's just bad practice in general for reasons like this.

@jdemeyer
Copy link

comment:13

Replying to @embray:

The further away bytes get from their original source, the harder it is to know where they came from or how they should be interpreted.

Easy: as a char * in C. I don't see how one could get this wrong. With str of course, it is possible to get the encoding wrong.

This compounded when you have bytes coming from multiple sources, in possibly different encodings

This is not the case here.

Many examples of where this can go wrong are discussed here: http://python-notes.curiousefficiency.org/en/latest/python3/questions_and_answers.html#why-not-just-assume-utf-8-and-avoid-having-to-decode-at-system-boundaries

That's a general discussion about using unicode in Python. I am specifically talking about filenames

and even in the rather recent PEP 529: https://www.python.org/dev/peps/pep-0529/

I see. If I understand things correctly, using bytes for filenames is currently problematic in Windows, but that would be fixed by that PEP. So it's still not a problem in the end.

Replying to @embray:

Or if you want just a more practical argument, going back to my point that this is mixing bytes and strings in a single list, this easily results in something like:

>>> paths = [b'/usr/lib', '/usr/local/lib']
>>> filename = [os.path.join(p, 'libfoo.so') for p in paths]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <listcomp>
  File "/usr/lib/python3.4/posixpath.py", line 89, in join
    "components") from None
TypeError: Can't mix strings and bytes in path components

Right. But I would argue to just use bytes everywhere.

@jdemeyer
Copy link

comment:14

Replying to @embray:

Armin Ronacher has written extensively about it, such as here: http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/

I just read this and I agree almost with everything he says... Python 3 insists too much on unicode.

One particular pet peeve of mine not mentioned in that essay is os.environ. At the system level, that's just char * but Python 3 insists on interpreting these byte strings as unicode somehow.

@vbraun
Copy link
Member

vbraun commented Dec 13, 2016

Changed branch from u/chapoton/22044 to 02c5061

@vbraun
Copy link
Member

vbraun commented Dec 14, 2016

Changed commit from 02c5061 to none

@vbraun
Copy link
Member

vbraun commented Dec 14, 2016

comment:16

Filenames are byte strings on Linux. I think there is a Windows issue somewhere here but hopefully thats ok if the path is plain ascii. Python is moving towards byte string filenames on Windows, too, to facilitate posix compatibility.

@embray
Copy link
Contributor

embray commented Dec 14, 2016

comment:17

Sorry, but no, this is absolutely wrong. Why are you mixing bytes and str objects in a single list in application code?

@embray embray reopened this Dec 14, 2016
@embray
Copy link
Contributor

embray commented Dec 14, 2016

comment:18

It is absolutely, 100% against the philosophy and design of Python 3 to be passing around bytes objects for filenames, or much anything else that isn't meant to be interacted with as binary data. As soon as a user tries to join a (unicode) string to it all hell breaks loose. In this specific example it is mostly irrelevant, but you're basically arguing that all filenames should be passed around in Sage as bytes objects which:

a) You haven't implemented yet--either be 100% consistent or go home.  To be clear, this also means that you have to fight a constant uphill battle with every Python API that returns  `str`, not `bytes` for filenames, such as `os.listdir`, `os.getcwd`, `os.curdir`, `os.sep`, and many others.  Some of these are functions that when passed bytes return bytes (which means if you want bytes you have to make sure they are *always* passed bytes regardless what some users pastes into their terminal).  In other cases they are not.

b) Is, as previously stated, not the right thing to do and is user- and developer-hostile.

You're confusing what filenames are as internally represented by the OS, and what they actually represent, semantically. If I'm a human being, and my home directory is "C:\Users\риго́рий Перельма́н", then my home directory is "C:\Users\риго́рий Перельма́н", not b'C:\\Users\\\xd1\x80\xd0\xb8\xd0\xb3\xd0\xbe\xcc\x81\xd1\x80\xd0\xb8\xd0\xb9 \xd0\x9f\xd0\xb5\xd1\x80\xd0\xb5\xd0\xbb\xd1\x8c\xd0\xbc\xd0\xb0\xcc\x81\xd0\xbd\x00', or worse b'\xff\xfeC\x00:\x00\\\x00U\x00s\x00e\x00r\x00s\x00\\\x00@\x048\x043\x04>\x04\x01\x03@\x048\x049\x04 \x00\x1f\x045\x04@\x045\x04;\x04L\x04<\x040\x04\x01\x03=\x04'. That's how the computer (maybe--depending on the system I'm on or how network filesystems are mounted, etc.) sees it, but that's not what it is. Decoding from bytes to unicode codepoints and operating solely on those until it has to go back to the OS the only consistent and safe way to treat such data.

I speak from hard-fought experience here having helped lead the Python 3 porting effort of a large, 10-15 year old codebase, which interacts with legacy ASCII-based applications and file formats, and is used by international users across platforms and had to have backwards compatibility with users' assumptions that strings (i.e., when they type "abc") are character strings, not opaque blobs of bytes.

@embray
Copy link
Contributor

embray commented Dec 14, 2016

comment:19

Replying to @jdemeyer:

Replying to @embray:

Armin Ronacher has written extensively about it, such as here: http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/

I just read this and I agree almost with everything he says... Python 3 insists too much on unicode.

One particular pet peeve of mine not mentioned in that essay is os.environ. At the system level, that's just char * but Python 3 insists on interpreting these byte strings as unicode somehow.

You can agree with Armin, and you wouldn't be altogether wrong. But where you are wrong is trying to defy the design and intent of Python 3. Armin brought up these issues in part to influence how Python 3 moved forward in treating these issue more sanely to people like him who frequently work on boundary code. However, nowhere does he argue that Python 3 should just be used incorrectly if you don't have to.

@embray
Copy link
Contributor

embray commented Dec 14, 2016

comment:20

Replying to @jdemeyer:

One particular pet peeve of mine not mentioned in that essay is os.environ. At the system level, that's just char * but Python 3 insists on interpreting these byte strings as unicode somehow.

It's not "somehow"--it's using the sys.getfilesystemencoding() with surrogateescape handling, which most of the time is right. os.environb was introduced to read the environment directly as bytes.

@jdemeyer
Copy link

comment:22

Replying to @embray:

It's not "somehow"--it's using the sys.getfilesystemencoding() with surrogateescape handling, which most of the time is right.

The "most of the time" is going to bite you someday. bytes are never wrong.

@jdemeyer
Copy link

comment:23

Whether you or me are right or wrong, you should never re-open a ticket that the release manager has closed. Feel free to continue the discussion on a new ticket or to complain directly to the release manager to revert this.

@embray
Copy link
Contributor

embray commented Dec 15, 2016

comment:24

That's fair.

@embray
Copy link
Contributor

embray commented Dec 15, 2016

comment:25

Replying to @vbraun:

Python is moving towards byte string filenames on Windows, too, to facilitate posix compatibility.

This doesn't make any sense. In Python filenames are neither "bytes" or "str". Python does not have a "filename" type (though it has been argued for in the past, and in fact there's a relevant PEP being drafted somewhere, though I think instead they went with yet another magic method of some kind :)

If you're writing code that is solely interfacing with POSIX interfaces then yes, filenames are just collections of bytes and can stay in bytes form. But outside that narrow context you have to think about filenames in the abstract, which more-often-than-not a text string of human-readable glyphs (I think this is one thing Windows got right, though the POSIX approach has its advantages as well). Filenames are a user-interface. That's why they're not even stored in inodes, but rather in dirents as a convenient way for humans (and programs written and used by humans) to locate files. While it's true in POSIX a filename can be any arbitrary sequence of bytes, and the kernel is agnostic to such sordid details as character encodings, that's an implementation choice.

In Python, although there are wrappers around POSIX interfaces, you're writing Python not POSIX, and there text is treated as text (especially in Python 3) and in most cases filenames too are text. In order wedge the POSIX notion of "filenames are char arrays" the surrogateescape decoder allows round-tripping arbitrary bytes that can't be decoded to unicode codepoints. The need for this in real filenames is rare--it's more commonly applicable in other contexts where the assumed encoding is not correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants