fix incompatibility with py3 in autogen/pari #22044

fchapoton · 2016-12-11T09:52:34Z

caused by #21613

CC: @embray @mkoeppe @jdemeyer

Component: python3

Author: Frédéric Chapoton

Branch: 02c5061

Reviewer: Jeroen Demeyer

Issue created by migration from https://trac.sagemath.org/ticket/22044

The text was updated successfully, but these errors were encountered:

fchapoton · 2016-12-11T09:53:20Z

Branch: u/chapoton/22044

fchapoton · 2016-12-11T09:53:20Z

Commit: 02c5061

fchapoton · 2016-12-11T09:53:20Z

New commits:

`02c5061`	`py3 fixing src/sage_setup/autogen/pari/__init__.py`

jdemeyer · 2016-12-11T12:51:20Z

Reviewer: Jeroen Demeyer

embray · 2016-12-12T10:32:09Z

comment:3

This does't look right to me. Instead you should change pari_share to return a string instead of bytes (by decoding the output with the system's filesystem encoding): https://github.com/sagemath/sagetrac-mirror/blob/develop/src/sage_setup/autogen/pari/parser.py?id=02c50619cbb13a5c8e32a604e2e55642f5fbad4e#n41

fchapoton · 2016-12-12T10:35:12Z

comment:5

Please take care of that in whatever way you want.

jdemeyer · 2016-12-12T12:19:40Z

comment:6

Replying to @embray:

Instead you should change pari_share to return a string instead of bytes

Why introduce an extra decoding/encoding step? The native format for filenames is bytes since that is what the system calls accept. If Python has to convert the filename to bytes anyway, why not just feed it bytes?

embray · 2016-12-12T13:10:29Z

comment:8

Well for one, the patch as it stands is creating an inhomogeneous list containing bytes and str object. Of course, you can do that, but it's a little surprising and confusing. Either treat filenames as byte strings everywhere, or nowhere (but I suggest nowhere because it is user- and developer-hostile to have to look at bytes objects containing non-ASCII character data :) In fact it's more common to deal with filenames as str objects in Python for exactly this reason, except for right at the system boundary, or when dealing directly with binary data.

In fact there is no "native format for filenames" in Python in part because it's more complicated than that especially when you consider Windows--there is no one right answer to that question. Which is why all the stdlib functions which take a filesystem path accept both bytes and unicode.

In any case, the general pattern in Python 3 with functions at the system boundary is to immediately convert from bytes to text, use text, and the convert back to bytes when going back out to the system (which in many cases, like opening files, happens transparently in Python). The only exception is when those bytes will never, ever be used in another context, like working with file and network protocols directly. The extra encoding/decoding steps are trivial otherwise.

embray · 2016-12-12T13:25:19Z

comment:9

FWIW an easy enough workaround is to pass universal_newlines=True to the subprocess.Popen. On Python 2 this doesn't make an enormous difference, especially in Sage since it's not likely going to be calling any programs that output CRLFs. But on Python 3 it does make a difference since it wraps the streams in TextIOWrappers, which decode bytes to text, by default using locale.getpreferredencoding() which is generally appropriate.

jdemeyer · 2016-12-12T14:02:18Z

comment:10

Replying to @embray:

Either treat filenames as byte strings everywhere, or nowhere (but I suggest nowhere because it is user- and developer-hostile to have to look at bytes objects containing non-ASCII character data :) In fact it's more common to deal with filenames as str objects in Python for exactly this reason, except for right at the system boundary, or when dealing directly with binary data.

I don't get what the problem is with treating filenames as bytes everywhere. It seems to me that things are least error-prone if you avoid conversions bytes <-> str.

FWIW an easy enough workaround[...]

When you use bytes, you don't need a workaround at all.

embray · 2016-12-12T16:23:43Z

comment:11

It's not really a "workaround"--poor choice of words. What I mean is it's an easy way to go from binary to text in that context.

It seems to me that things are least error-prone if you avoid conversions bytes <-> str.

That's where you're wrong. The further away bytes get from their original source, the harder it is to know where they came from or how they should be interpreted. This compounded when you have bytes coming from multiple sources, in possibly different encodings, and you try to combine them while ignoring where they came from. Many examples of where this can go wrong are discussed here: http://python-notes.curiousefficiency.org/en/latest/python3/questions_and_answers.html#why-not-just-assume-utf-8-and-avoid-having-to-decode-at-system-boundaries and even in the rather recent PEP 529: https://www.python.org/dev/peps/pep-0529/

You're not in poor company having doubts about this. Armin Ronacher has written extensively about it, such as here: http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/ I don't agree with him (and for some reasons he doesn't give in his list of possible objections), but he's not without a point, and in fact has since pushed for many useful updates to Python 3 to make dealing with bytes and strings a bit less of a hassle.

embray · 2016-12-12T16:41:08Z

comment:12

Or if you want just a more practical argument, going back to my point that this is mixing bytes and strings in a single list, this easily results in something like:

>>> paths = [b'/usr/lib', '/usr/local/lib']
>>> filename = [os.path.join(p, 'libfoo.so') for p in paths]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <listcomp>
  File "/usr/lib/python3.4/posixpath.py", line 89, in join
    "components") from None
TypeError: Can't mix strings and bytes in path components

Although that may or may not apply in this specific case, it's just bad practice in general for reasons like this.

jdemeyer · 2016-12-13T20:36:06Z

comment:13

Replying to @embray:

The further away bytes get from their original source, the harder it is to know where they came from or how they should be interpreted.

Easy: as a char * in C. I don't see how one could get this wrong. With str of course, it is possible to get the encoding wrong.

This compounded when you have bytes coming from multiple sources, in possibly different encodings

This is not the case here.

Many examples of where this can go wrong are discussed here: http://python-notes.curiousefficiency.org/en/latest/python3/questions_and_answers.html#why-not-just-assume-utf-8-and-avoid-having-to-decode-at-system-boundaries

That's a general discussion about using unicode in Python. I am specifically talking about filenames

and even in the rather recent PEP 529: https://www.python.org/dev/peps/pep-0529/

I see. If I understand things correctly, using bytes for filenames is currently problematic in Windows, but that would be fixed by that PEP. So it's still not a problem in the end.

Replying to @embray:

Or if you want just a more practical argument, going back to my point that this is mixing bytes and strings in a single list, this easily results in something like:

>>> paths = [b'/usr/lib', '/usr/local/lib']
>>> filename = [os.path.join(p, 'libfoo.so') for p in paths]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <listcomp>
  File "/usr/lib/python3.4/posixpath.py", line 89, in join
    "components") from None
TypeError: Can't mix strings and bytes in path components

Right. But I would argue to just use bytes everywhere.

jdemeyer · 2016-12-13T21:36:23Z

comment:14

Replying to @embray:

Armin Ronacher has written extensively about it, such as here: http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/

I just read this and I agree almost with everything he says... Python 3 insists too much on unicode.

One particular pet peeve of mine not mentioned in that essay is os.environ. At the system level, that's just char * but Python 3 insists on interpreting these byte strings as unicode somehow.

vbraun · 2016-12-13T23:55:36Z

Changed branch from u/chapoton/22044 to 02c5061

vbraun · 2016-12-14T00:15:14Z

Changed commit from 02c5061 to none

vbraun · 2016-12-14T00:15:14Z

comment:16

Filenames are byte strings on Linux. I think there is a Windows issue somewhere here but hopefully thats ok if the path is plain ascii. Python is moving towards byte string filenames on Windows, too, to facilitate posix compatibility.

embray · 2016-12-14T12:06:10Z

comment:17

Sorry, but no, this is absolutely wrong. Why are you mixing bytes and str objects in a single list in application code?

embray · 2016-12-14T12:08:49Z

comment:18

It is absolutely, 100% against the philosophy and design of Python 3 to be passing around bytes objects for filenames, or much anything else that isn't meant to be interacted with as binary data. As soon as a user tries to join a (unicode) string to it all hell breaks loose. In this specific example it is mostly irrelevant, but you're basically arguing that all filenames should be passed around in Sage as bytes objects which:

a) You haven't implemented yet--either be 100% consistent or go home.  To be clear, this also means that you have to fight a constant uphill battle with every Python API that returns  `str`, not `bytes` for filenames, such as `os.listdir`, `os.getcwd`, `os.curdir`, `os.sep`, and many others.  Some of these are functions that when passed bytes return bytes (which means if you want bytes you have to make sure they are *always* passed bytes regardless what some users pastes into their terminal).  In other cases they are not.

b) Is, as previously stated, not the right thing to do and is user- and developer-hostile.

You're confusing what filenames are as internally represented by the OS, and what they actually represent, semantically. If I'm a human being, and my home directory is "C:\Users\риго́рий Перельма́н", then my home directory is "C:\Users\риго́рий Перельма́н", not b'C:\\Users\\\xd1\x80\xd0\xb8\xd0\xb3\xd0\xbe\xcc\x81\xd1\x80\xd0\xb8\xd0\xb9 \xd0\x9f\xd0\xb5\xd1\x80\xd0\xb5\xd0\xbb\xd1\x8c\xd0\xbc\xd0\xb0\xcc\x81\xd0\xbd\x00', or worse b'\xff\xfeC\x00:\x00\\\x00U\x00s\x00e\x00r\x00s\x00\\\x00@\x048\x043\x04>\x04\x01\x03@\x048\x049\x04 \x00\x1f\x045\x04@\x045\x04;\x04L\x04<\x040\x04\x01\x03=\x04'. That's how the computer (maybe--depending on the system I'm on or how network filesystems are mounted, etc.) sees it, but that's not what it is. Decoding from bytes to unicode codepoints and operating solely on those until it has to go back to the OS the only consistent and safe way to treat such data.

I speak from hard-fought experience here having helped lead the Python 3 porting effort of a large, 10-15 year old codebase, which interacts with legacy ASCII-based applications and file formats, and is used by international users across platforms and had to have backwards compatibility with users' assumptions that strings (i.e., when they type "abc") are character strings, not opaque blobs of bytes.

embray · 2016-12-14T12:11:08Z

comment:19

Replying to @jdemeyer:

Replying to @embray:

Armin Ronacher has written extensively about it, such as here: http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/

I just read this and I agree almost with everything he says... Python 3 insists too much on unicode.

One particular pet peeve of mine not mentioned in that essay is os.environ. At the system level, that's just char * but Python 3 insists on interpreting these byte strings as unicode somehow.

You can agree with Armin, and you wouldn't be altogether wrong. But where you are wrong is trying to defy the design and intent of Python 3. Armin brought up these issues in part to influence how Python 3 moved forward in treating these issue more sanely to people like him who frequently work on boundary code. However, nowhere does he argue that Python 3 should just be used incorrectly if you don't have to.

embray · 2016-12-14T12:24:27Z

comment:20

Replying to @jdemeyer:

One particular pet peeve of mine not mentioned in that essay is os.environ. At the system level, that's just char * but Python 3 insists on interpreting these byte strings as unicode somehow.

It's not "somehow"--it's using the sys.getfilesystemencoding() with surrogateescape handling, which most of the time is right. os.environb was introduced to read the environment directly as bytes.

jdemeyer · 2016-12-14T15:16:28Z

comment:22

Replying to @embray:

It's not "somehow"--it's using the sys.getfilesystemencoding() with surrogateescape handling, which most of the time is right.

The "most of the time" is going to bite you someday. bytes are never wrong.

jdemeyer · 2016-12-14T15:20:11Z

comment:23

Whether you or me are right or wrong, you should never re-open a ticket that the release manager has closed. Feel free to continue the discussion on a new ticket or to complain directly to the release manager to revert this.

embray · 2016-12-15T11:40:17Z

comment:24

That's fair.

embray · 2016-12-15T11:56:23Z

comment:25

Replying to @vbraun:

Python is moving towards byte string filenames on Windows, too, to facilitate posix compatibility.

This doesn't make any sense. In Python filenames are neither "bytes" or "str". Python does not have a "filename" type (though it has been argued for in the past, and in fact there's a relevant PEP being drafted somewhere, though I think instead they went with yet another magic method of some kind :)

If you're writing code that is solely interfacing with POSIX interfaces then yes, filenames are just collections of bytes and can stay in bytes form. But outside that narrow context you have to think about filenames in the abstract, which more-often-than-not a text string of human-readable glyphs (I think this is one thing Windows got right, though the POSIX approach has its advantages as well). Filenames are a user-interface. That's why they're not even stored in inodes, but rather in dirents as a convenient way for humans (and programs written and used by humans) to locate files. While it's true in POSIX a filename can be any arbitrary sequence of bytes, and the kernel is agnostic to such sordid details as character encodings, that's an implementation choice.

In Python, although there are wrappers around POSIX interfaces, you're writing Python not POSIX, and there text is treated as text (especially in Python 3) and in most cases filenames too are text. In order wedge the POSIX notion of "filenames are char arrays" the surrogateescape decoder allows round-tripping arbitrary bytes that can't be decoded to unicode codepoints. The need for this in real filenames is rare--it's more commonly applicable in other contexts where the assumed encoding is not correct.

fchapoton added this to the sage-7.5 milestone Dec 11, 2016

fchapoton added c: python3 labels Dec 11, 2016

fchapoton added the s: needs review label Dec 11, 2016

jdemeyer added s: positive review and removed s: needs review labels Dec 11, 2016

This comment has been minimized.

Sign in to view

embray added s: needs work and removed s: positive review labels Dec 12, 2016

This comment has been minimized.

Sign in to view

jdemeyer added s: needs review and removed s: needs work labels Dec 12, 2016

vbraun removed the s: needs review label Dec 13, 2016

vbraun closed this as completed in 58d6a03 Dec 13, 2016

embray reopened this Dec 14, 2016

jdemeyer closed this as completed Dec 14, 2016

embray mentioned this issue Mar 31, 2017

Don't mix bytes and strings for pathnames #22108

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix incompatibility with py3 in autogen/pari #22044

fix incompatibility with py3 in autogen/pari #22044

fchapoton commented Dec 11, 2016

fchapoton commented Dec 11, 2016

fchapoton commented Dec 11, 2016

fchapoton commented Dec 11, 2016

jdemeyer commented Dec 11, 2016

This comment has been minimized.

embray commented Dec 12, 2016

This comment has been minimized.

fchapoton commented Dec 12, 2016

jdemeyer commented Dec 12, 2016

embray commented Dec 12, 2016

embray commented Dec 12, 2016

jdemeyer commented Dec 12, 2016

embray commented Dec 12, 2016

embray commented Dec 12, 2016

jdemeyer commented Dec 13, 2016

jdemeyer commented Dec 13, 2016

vbraun commented Dec 13, 2016

vbraun commented Dec 14, 2016

vbraun commented Dec 14, 2016

embray commented Dec 14, 2016

embray commented Dec 14, 2016

embray commented Dec 14, 2016

embray commented Dec 14, 2016

jdemeyer commented Dec 14, 2016

jdemeyer commented Dec 14, 2016

embray commented Dec 15, 2016

embray commented Dec 15, 2016

fix incompatibility with py3 in autogen/pari #22044

fix incompatibility with py3 in autogen/pari #22044

Comments

fchapoton commented Dec 11, 2016

fchapoton commented Dec 11, 2016

fchapoton commented Dec 11, 2016

fchapoton commented Dec 11, 2016

jdemeyer commented Dec 11, 2016

This comment has been minimized.

embray commented Dec 12, 2016

This comment has been minimized.

fchapoton commented Dec 12, 2016

jdemeyer commented Dec 12, 2016

embray commented Dec 12, 2016

embray commented Dec 12, 2016

jdemeyer commented Dec 12, 2016

embray commented Dec 12, 2016

embray commented Dec 12, 2016

jdemeyer commented Dec 13, 2016

jdemeyer commented Dec 13, 2016

vbraun commented Dec 13, 2016

vbraun commented Dec 14, 2016

vbraun commented Dec 14, 2016

embray commented Dec 14, 2016

embray commented Dec 14, 2016

embray commented Dec 14, 2016

embray commented Dec 14, 2016

jdemeyer commented Dec 14, 2016

jdemeyer commented Dec 14, 2016

embray commented Dec 15, 2016

embray commented Dec 15, 2016