py3: simplified string conversion utilities #24222

embray · 2017-11-16T11:38:13Z

A possible alternative to #24186, implementing simple conversion from C char arrays or bytes objects to str objects, and of str objects to bytes objects. Here "str" and "bytes" are to be read exactly for either Python 2 or Python 3, so on Python 2 this means no conversion is performed since str is bytes == True.

One thing this does not do is implement any kind of conversion from Python 2 unicode objects to bytes. This functionality might be worth adding, in some form, to str_to_bytes. But this would add a new feature on Python 2, whereas for now I'm only trying to preserve the existing functionality on Python 2 exactly, while transparently supporting Python 3 strs everywhere that Python 2 strs are supported.

CC: @fchapoton @jdemeyer

Component: python3

Author: Erik Bray, Jeroen Demeyer

Branch/Commit: dec9f3a

Reviewer: Jeroen Demeyer, Erik Bray

Issue created by migration from https://trac.sagemath.org/ticket/24222

The text was updated successfully, but these errors were encountered:

sagetrac-git · 2017-11-16T11:38:42Z

Branch pushed to git repo; I updated commit sha1. New commits:

`4c78743`	`Add new string conversion utilities.`

sagetrac-git · 2017-11-16T11:38:42Z

Commit: 4c78743

jdemeyer · 2017-11-16T11:59:21Z

comment:3

OK, so now we have to discuss encodings... first of all, I really don't like the encoding argument in the functions, that will just slow everything down. Better pick one (or a few) good encoding(s) and use those.

I don't see why you would pick locale.getpreferredencoding() as default encoding. Relevant encodings should be:

ascii. In many cases we know that output from a C library is an ASCII string (for example, the string representation of a number). Luckily, ASCII is compatible with most other encodings, so this isn't an issue.
filesystemencoding. This is the encoding that Python internally uses for most conversions char* <-> str. For this reason, I think that this is the most important use case and it should be the default when in doubt.
utf-8. This is good if you just want to store unicode as char* (say, an internal user-specified name for something) and you don't care about the exact encoding but you do care that round-trip conversion works properly.
locale.getpreferredencoding(). I believe that this is mainly relevant for subprocesses, provided that those subprocesses also look at the encoding.

For efficiency, I would rather have several sets of functions, each with a hard-coded encoding.

I would also suggest to implement these functions (or at least the functions where you care about performance) as cdef inline functions in the .pxd file. That way, they will actually be inlined when calling them from an external Cython module.

jdemeyer · 2017-11-16T12:03:23Z

comment:4

You don't need six.PY2. You can just replace

if six.PY2:
    from cpython.string cimport PyString_FromString
else:
    from cpython.bytes cimport PyBytes_FromString as PyString_FromString

by

from cpython.bytes cimport PyBytes_FromString as PyString_FromString

You could also just use PyBytes_FromString in the code.

embray · 2017-11-16T12:31:52Z

comment:5

Replying to @jdemeyer:

You don't need six.PY2. You can just replace

That's what I thought originally too, but I was having some problems trying to do it as a one-liner. I'll try your suggestion though and double-check whether it works.

embray · 2017-11-16T12:48:23Z

comment:6

I don't think the "encoding" argument slows anything down by any significant amount, especially in cases where it isn't used. I'd be amenable to encoding-specific functions for some cases, as Python has those as well in its API. But right now I'm really trying to avoid breadth of API surface. It is good to have generic versions of these functions that accept any encoding, and this is the bare minimum needed to get Python 3 support off the ground. I think that later we can encoding-specific functions where specific use cases for them can be demonstrated. I don't want to get into the weeds with this right now.

As for the default encoding, locale.getpreferredencoding() makes more sense. sys.getfilesystemencoding() is specifically about guessing (more or less) what encoding the filesystem specifically is using for encoding e.g. of filenames, and has nothing to do with encodings of file contents or of arbitrary strings output by software. (Well, that's not exactly true since on *nix the two are essentially the same, although the former can change if the process's locale is changed, I think).

So on *NIX platforms like we care about the most it's actually mostly a moot point. But a lot of the software Sage interfaces with is locale-aware, so it's better to take that into account as the default than not.

embray · 2017-11-16T12:50:35Z

comment:7

I would also suggest to implement these functions (or at least the functions where you care about performance) as cdef inline functions in the .pxd file.

I was wondering about that. But in that case the .pyx file will be empty (does it even need to exist at all)? What about the default encoding variables?

jdemeyer · 2017-11-16T12:57:39Z

comment:8

Maybe the best solution would be to not have a default encoding and require users to think what encoding they want?

jdemeyer · 2017-11-16T13:40:43Z

comment:9

Replying to @embray:

I don't think the "encoding" argument slows anything down by any significant amount

The Python 3 C API has specific functions to encode/decode for certainly particular encodings:

I haven't profiled, but I would hope that these are faster than the generic API.

jdemeyer · 2017-11-16T13:42:05Z

comment:10

Replying to @embray:

But in that case the .pyx file will be empty (does it even need to exist at all)?

It doesn't need to exist.

What about the default encoding variables?

I propose to remove those variables anyway. If you do need them for some reason, they must be in the .pyx file.

jdemeyer · 2017-11-16T13:51:34Z

comment:11

Two more things:

it is safer to use the Cython-compile-time constant PY_VERSION_HEX instead of the C-compile-time constant PY_MAJOR_VERSION. It is dubious C code to write something like

if (0)
    nonexisting_function()

when nonexisting_function() doesn't actually exist like PyUnicode_AsUTF8 on Python 2. The only reason that the above code compiles is compiler optimization.

See also #24215 for Cython-compile-time constants in general.

Could you move this to sage/cpython/string.pxd? I recently created the sage/cpython directory to deal with low-level Python internals and I think that these functions would fit there.

fchapoton · 2017-11-16T15:16:15Z

comment:12

fails to build, see patchbot report:

from string import Template

fchapoton · 2017-11-16T15:20:40Z

comment:13

oh, I see. Name conflict between the global "string" module and the local "string" module..

embray · 2017-11-17T14:22:11Z

comment:14

Replying to @jdemeyer:

Maybe the best solution would be to not have a default encoding and require users to think what encoding they want?

That's far too onerous and anxiety-provoking for the user :) Python itself uses default encodings all over the place when in doubt (hence e.g. PyUnicode_DecodeFSDefault). It's an unfortunate fact that there's isn't always one "right answer" here; the best we can do is provide sensible defaults and the ability for user-specified encoding where applicable; that is, where we know we want a specific encoding. Moving forward we can also do more, for example, to ensure that any locale-aware code run by Sage is handled well.

I could definitely agree to adding more encoding-specific helper functions, especially for ASCII and UTF-8. But as a first pass, for the sake of getting Python 3 support off the ground, I'd prefer to leave this as is and then make adjustments as specific use cases arise. It will be difficult to even find those specific use cases until and unless we get further along on getting Python 3 working in general (with these functions, plus a few other fixes I'll be posting soon, I've gotten the Sage doctest runner working, so that will help expose a lot of interesting cases quickly).

I'll look at the rest of your suggestions; they seem reasonable.

embray · 2017-11-17T14:22:49Z

comment:15

Replying to @jdemeyer:

Two more things:

it is safer to use the Cython-compile-time constant PY_VERSION_HEX instead of the C-compile-time constant PY_MAJOR_VERSION. It is dubious C code to write something like

Ah, I was actually really looking for something like this but I couldn't find it anywhere in the Cython documentation. Am I just blind?

embray · 2017-11-17T14:26:22Z

comment:16

Replying to @embray:

Replying to @jdemeyer:

Two more things:

it is safer to use the Cython-compile-time constant PY_VERSION_HEX instead of the C-compile-time constant PY_MAJOR_VERSION. It is dubious C code to write something like

Ah, I was actually really looking for something like this but I couldn't find it anywhere in the Cython documentation. Am I just blind?

Nevermind; I see now that we explicitly pass that in to cythonize in the setup.py. For that matter, any reason not to add simply PY2 and PY3 compile-time constants? Comparing against PY_VERSION_HEX is a little annoying.

fchapoton · 2017-11-17T14:57:07Z

comment:17

It will be difficult to even find those specific use cases until and unless we get >further along on getting Python 3 working in general (with these functions, plus a >few other fixes I'll be posting soon, I've gotten the Sage doctest runner working, so >that will help expose a lot of interesting cases quickly).

This sounds great ! I was hoping that the doctest framework could be made to work at some point, but was not expecting it soon.

embray · 2017-11-17T15:09:30Z

comment:18

Replying to @fchapoton:

It will be difficult to even find those specific use cases until and unless we get >further along on getting Python 3 working in general (with these functions, plus a >few other fixes I'll be posting soon, I've gotten the Sage doctest runner working, so >that will help expose a lot of interesting cases quickly).

This sounds great ! I was hoping that the doctest framework could be made to work at some point, but was not expecting it soon.

I had to get it working, in part, so that I could run the doctests for this module :)

I think it will help things go much faster.

embray · 2017-11-17T15:33:05Z

comment:19

I'm seeing now how wanting to have module-level global variables in conjunction with inline c(p)def functions in Cython gets problematic. It just compiles them as __Pyx_GetModuleGlobalName(name). I feel like this should be a bug (I think not too difficult to fix?) in Cython, since it's really counter to how Python should work in this case.

For functions inlined from another module--at least if those functions access global variables from their original module--it should import that module during module initialization and use the correct module dict for globals lookups (just as normal Python functions do, basically).

That's an issue beyond this one though, so I'll rework things for now to get rid use of the global variables by these functions (I still want to have default encodings though).

embray · 2017-11-17T16:37:48Z

comment:20

So it turns out PyUnicode_AsEncodedString and PyUnicode_Decode do allow the encoding argument to be NULL, in which case they default to UTF-8. So if it's a good enough default for Python maybe it's a good enough default for us, for now.

Perhaps I'll just stick with that functionality, and take care to use things like locale.getpreferredencoding() when necessary, such as interfacing with external software. To that end it would be good to come up with some list of exactly what software Sage interfaces with is locale aware. Certainly ECL and GAP are strong candidates; probably others. It will be good to add some tests to that effect but we can worry about it when we come to it.

jdemeyer · 2017-11-17T16:56:57Z

comment:21

Replying to @embray:

Python itself uses default encodings all over the place when in doubt

Right, but there are several defaults (each of UTF-8, sys.getfilesystemencoding() and locale.getpreferredencoding() could be considered as defaults). And I don't think that any of those 3 has the large majority. So I still feel that forcing the user to pick one is the best solution.

jdemeyer · 2017-11-17T17:03:03Z

comment:22

Also, I feel that error handling should be different for the different cases: if you are communicating with locale-aware software using locale.getpreferredencoding() you probably want decoding errors to be actual errors. With the file system encoding on the other hand, surrogateescape is typically a better default.

jdemeyer · 2017-11-17T17:14:59Z

comment:23

Replying to @embray:

For that matter, any reason not to add simply PY2 and PY3 compile-time constants? Comparing against PY_VERSION_HEX is a little annoying.

No problem for me, although I would prefer PY_MAJOR_VERSION then. If you want to do that, please make a new ticket and make it depend on #24215 please.

embray · 2017-12-15T10:37:59Z

comment:84

Fine but it really doesn't make much difference.

sagetrac-git · 2017-12-15T11:14:10Z

Changed commit from 2bb8396 to 6d73a41

sagetrac-git · 2017-12-15T11:14:10Z

Branch pushed to git repo; I updated commit sha1. New commits:

`6d73a41`	`Use more specific type declarations here.`

jdemeyer · 2017-12-15T12:32:24Z

comment:87

Good for me if it passes testing.

fchapoton · 2017-12-15T15:16:02Z

comment:88

one green bot. I am setting to positive this important ticket.

vbraun · 2017-12-17T10:51:10Z

comment:89

Merge conflict

fchapoton · 2017-12-20T09:13:04Z

comment:90

Volker, do you by chance have any idea about what was the conflicting ticket ?

embray · 2017-12-21T11:12:35Z

comment:91

Merge conflict with what? If there were a normal "master" branch into which tickets were merged regularly against which I could compare that would be one thing, but you can't just secretly merge a bunch of tickets all at once, claim "merge conflict", and expect me to guess what it conflicts with.

sagetrac-git · 2017-12-21T11:17:32Z

Changed commit from 6d73a41 to dec9f3a

sagetrac-git · 2017-12-21T11:17:32Z

Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:

`a285af4`	`Add new string conversion utilities.`
`5b3719a`	`Add support for optional error handling.`
`bf5f893`	`Make type checks explicit`
`1bd27fc`	`Require a const char*--this is more consistent for use in cases where`
`bd2d931`	`Use a less unusual character in the tests`
`b32f000`	`Use r-strings for the docstrings since they contain slashes`
`ccf7bcd`	`Clean up these declarations:`
`dec9f3a`	`Use more specific type declarations here.`

embray · 2017-12-21T11:17:52Z

comment:93

Rebased on current develop branch if it helps, but there was no merge conflict.

New commits:

`a285af4`	`Add new string conversion utilities.`
`5b3719a`	`Add support for optional error handling.`
`bf5f893`	`Make type checks explicit`
`1bd27fc`	`Require a const char*--this is more consistent for use in cases where`
`bd2d931`	`Use a less unusual character in the tests`
`b32f000`	`Use r-strings for the docstrings since they contain slashes`
`ccf7bcd`	`Clean up these declarations:`
`dec9f3a`	`Use more specific type declarations here.`

jdemeyer · 2017-12-21T12:14:48Z

comment:94

Replying to @embray:

Merge conflict with what? If there were a normal "master" branch into which tickets were merged regularly against which I could compare that would be one thing, but you can't just secretly merge a bunch of tickets all at once, claim "merge conflict", and expect me to guess what it conflicts with.

It's not secret: https://github.com/vbraun/sage/tree/develop

jdemeyer · 2017-12-25T02:06:12Z

Changed dependencies from #24246 to none

jdemeyer · 2017-12-25T02:06:12Z

comment:95

I don't see any conflict...

vbraun · 2017-12-26T09:25:41Z

Changed branch from u/embray/python3/string-conversions to dec9f3a

embray added this to the sage-8.1 milestone Nov 16, 2017

embray added c: python3 labels Nov 16, 2017

embray added the s: needs review label Nov 16, 2017

jdemeyer added s: needs info and removed s: needs review labels Nov 16, 2017

jdemeyer added the s: needs work label Dec 14, 2017

fchapoton added s: needs review and removed s: needs work labels Dec 15, 2017

fchapoton added s: positive review and removed s: needs review labels Dec 15, 2017

vbraun added s: needs work and removed s: positive review labels Dec 17, 2017

jdemeyer added s: positive review and removed s: needs work labels Dec 25, 2017

vbraun removed the s: positive review label Dec 26, 2017

vbraun closed this as completed in bbd5bdb Dec 26, 2017

rwst mentioned this issue Feb 12, 2018

Py3: TypeError from Cython string conversion #23648

Closed

embray mentioned this issue Feb 2, 2018

py3: several string conversion fixes #24223

Closed

jdemeyer mentioned this issue Dec 11, 2017

Absolute import when importing string #24242

Closed

embray mentioned this issue Dec 13, 2017

Add a Cython compile time constant for PY_MAJOR_VERSION #24246

Closed

py3: simplified string conversion utilities #24222

py3: simplified string conversion utilities #24222

Comments

embray commented Nov 16, 2017

sagetrac-git mannequin commented Nov 16, 2017

sagetrac-git mannequin commented Nov 16, 2017

jdemeyer commented Nov 16, 2017

jdemeyer commented Nov 16, 2017

embray commented Nov 16, 2017

embray commented Nov 16, 2017

embray commented Nov 16, 2017

jdemeyer commented Nov 16, 2017

jdemeyer commented Nov 16, 2017

jdemeyer commented Nov 16, 2017

jdemeyer commented Nov 16, 2017

fchapoton commented Nov 16, 2017

fchapoton commented Nov 16, 2017

embray commented Nov 17, 2017

embray commented Nov 17, 2017

embray commented Nov 17, 2017

fchapoton commented Nov 17, 2017

embray commented Nov 17, 2017

embray commented Nov 17, 2017

embray commented Nov 17, 2017

jdemeyer commented Nov 17, 2017

jdemeyer commented Nov 17, 2017

jdemeyer commented Nov 17, 2017

embray commented Dec 15, 2017

sagetrac-git mannequin commented Dec 15, 2017

sagetrac-git mannequin commented Dec 15, 2017

jdemeyer commented Dec 15, 2017

fchapoton commented Dec 15, 2017

vbraun commented Dec 17, 2017

fchapoton commented Dec 20, 2017

embray commented Dec 21, 2017

sagetrac-git mannequin commented Dec 21, 2017

sagetrac-git mannequin commented Dec 21, 2017

embray commented Dec 21, 2017

jdemeyer commented Dec 21, 2017

jdemeyer commented Dec 25, 2017

jdemeyer commented Dec 25, 2017

vbraun commented Dec 26, 2017