Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

py3: simplified string conversion utilities #24222

Closed
embray opened this issue Nov 16, 2017 · 106 comments
Closed

py3: simplified string conversion utilities #24222

embray opened this issue Nov 16, 2017 · 106 comments

Comments

@embray
Copy link
Contributor

embray commented Nov 16, 2017

A possible alternative to #24186, implementing simple conversion from C char arrays or bytes objects to str objects, and of str objects to bytes objects. Here "str" and "bytes" are to be read exactly for either Python 2 or Python 3, so on Python 2 this means no conversion is performed since str is bytes == True.

One thing this does not do is implement any kind of conversion from Python 2 unicode objects to bytes. This functionality might be worth adding, in some form, to str_to_bytes. But this would add a new feature on Python 2, whereas for now I'm only trying to preserve the existing functionality on Python 2 exactly, while transparently supporting Python 3 strs everywhere that Python 2 strs are supported.

CC: @fchapoton @jdemeyer

Component: python3

Author: Erik Bray, Jeroen Demeyer

Branch/Commit: dec9f3a

Reviewer: Jeroen Demeyer, Erik Bray

Issue created by migration from https://trac.sagemath.org/ticket/24222

@embray embray added this to the sage-8.1 milestone Nov 16, 2017
@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Nov 16, 2017

Branch pushed to git repo; I updated commit sha1. New commits:

4c78743Add new string conversion utilities.

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Nov 16, 2017

Commit: 4c78743

@jdemeyer
Copy link

comment:3

OK, so now we have to discuss encodings... first of all, I really don't like the encoding argument in the functions, that will just slow everything down. Better pick one (or a few) good encoding(s) and use those.

I don't see why you would pick locale.getpreferredencoding() as default encoding. Relevant encodings should be:

  1. ascii. In many cases we know that output from a C library is an ASCII string (for example, the string representation of a number). Luckily, ASCII is compatible with most other encodings, so this isn't an issue.

  2. filesystemencoding. This is the encoding that Python internally uses for most conversions char* <-> str. For this reason, I think that this is the most important use case and it should be the default when in doubt.

  3. utf-8. This is good if you just want to store unicode as char* (say, an internal user-specified name for something) and you don't care about the exact encoding but you do care that round-trip conversion works properly.

  4. locale.getpreferredencoding(). I believe that this is mainly relevant for subprocesses, provided that those subprocesses also look at the encoding.

For efficiency, I would rather have several sets of functions, each with a hard-coded encoding.

I would also suggest to implement these functions (or at least the functions where you care about performance) as cdef inline functions in the .pxd file. That way, they will actually be inlined when calling them from an external Cython module.

@jdemeyer
Copy link

comment:4

You don't need six.PY2. You can just replace

if six.PY2:
    from cpython.string cimport PyString_FromString
else:
    from cpython.bytes cimport PyBytes_FromString as PyString_FromString

by

from cpython.bytes cimport PyBytes_FromString as PyString_FromString

You could also just use PyBytes_FromString in the code.

@embray
Copy link
Contributor Author

embray commented Nov 16, 2017

comment:5

Replying to @jdemeyer:

You don't need six.PY2. You can just replace

That's what I thought originally too, but I was having some problems trying to do it as a one-liner. I'll try your suggestion though and double-check whether it works.

@embray
Copy link
Contributor Author

embray commented Nov 16, 2017

comment:6

I don't think the "encoding" argument slows anything down by any significant amount, especially in cases where it isn't used. I'd be amenable to encoding-specific functions for some cases, as Python has those as well in its API. But right now I'm really trying to avoid breadth of API surface. It is good to have generic versions of these functions that accept any encoding, and this is the bare minimum needed to get Python 3 support off the ground. I think that later we can encoding-specific functions where specific use cases for them can be demonstrated. I don't want to get into the weeds with this right now.

As for the default encoding, locale.getpreferredencoding() makes more sense. sys.getfilesystemencoding() is specifically about guessing (more or less) what encoding the filesystem specifically is using for encoding e.g. of filenames, and has nothing to do with encodings of file contents or of arbitrary strings output by software. (Well, that's not exactly true since on *nix the two are essentially the same, although the former can change if the process's locale is changed, I think).

So on *NIX platforms like we care about the most it's actually mostly a moot point. But a lot of the software Sage interfaces with is locale-aware, so it's better to take that into account as the default than not.

@embray
Copy link
Contributor Author

embray commented Nov 16, 2017

comment:7

I would also suggest to implement these functions (or at least the functions where you care about performance) as cdef inline functions in the .pxd file.

I was wondering about that. But in that case the .pyx file will be empty (does it even need to exist at all)? What about the default encoding variables?

@jdemeyer
Copy link

comment:8

Maybe the best solution would be to not have a default encoding and require users to think what encoding they want?

@jdemeyer
Copy link

comment:9

Replying to @embray:

I don't think the "encoding" argument slows anything down by any significant amount

The Python 3 C API has specific functions to encode/decode for certainly particular encodings:

I haven't profiled, but I would hope that these are faster than the generic API.

@jdemeyer
Copy link

comment:10

Replying to @embray:

But in that case the .pyx file will be empty (does it even need to exist at all)?

It doesn't need to exist.

What about the default encoding variables?

I propose to remove those variables anyway. If you do need them for some reason, they must be in the .pyx file.

@jdemeyer
Copy link

comment:11

Two more things:

  1. it is safer to use the Cython-compile-time constant PY_VERSION_HEX instead of the C-compile-time constant PY_MAJOR_VERSION. It is dubious C code to write something like
if (0)
    nonexisting_function()

when nonexisting_function() doesn't actually exist like PyUnicode_AsUTF8 on Python 2. The only reason that the above code compiles is compiler optimization.

See also #24215 for Cython-compile-time constants in general.

  1. Could you move this to sage/cpython/string.pxd? I recently created the sage/cpython directory to deal with low-level Python internals and I think that these functions would fit there.

@fchapoton
Copy link
Contributor

comment:12

fails to build, see patchbot report:

from string import Template

@fchapoton
Copy link
Contributor

comment:13

oh, I see. Name conflict between the global "string" module and the local "string" module..

@embray
Copy link
Contributor Author

embray commented Nov 17, 2017

comment:14

Replying to @jdemeyer:

Maybe the best solution would be to not have a default encoding and require users to think what encoding they want?

That's far too onerous and anxiety-provoking for the user :) Python itself uses default encodings all over the place when in doubt (hence e.g. PyUnicode_DecodeFSDefault). It's an unfortunate fact that there's isn't always one "right answer" here; the best we can do is provide sensible defaults and the ability for user-specified encoding where applicable; that is, where we know we want a specific encoding. Moving forward we can also do more, for example, to ensure that any locale-aware code run by Sage is handled well.

I could definitely agree to adding more encoding-specific helper functions, especially for ASCII and UTF-8. But as a first pass, for the sake of getting Python 3 support off the ground, I'd prefer to leave this as is and then make adjustments as specific use cases arise. It will be difficult to even find those specific use cases until and unless we get further along on getting Python 3 working in general (with these functions, plus a few other fixes I'll be posting soon, I've gotten the Sage doctest runner working, so that will help expose a lot of interesting cases quickly).

I'll look at the rest of your suggestions; they seem reasonable.

@embray
Copy link
Contributor Author

embray commented Nov 17, 2017

comment:15

Replying to @jdemeyer:

Two more things:

  1. it is safer to use the Cython-compile-time constant PY_VERSION_HEX instead of the C-compile-time constant PY_MAJOR_VERSION. It is dubious C code to write something like

Ah, I was actually really looking for something like this but I couldn't find it anywhere in the Cython documentation. Am I just blind?

@embray
Copy link
Contributor Author

embray commented Nov 17, 2017

comment:16

Replying to @embray:

Replying to @jdemeyer:

Two more things:

  1. it is safer to use the Cython-compile-time constant PY_VERSION_HEX instead of the C-compile-time constant PY_MAJOR_VERSION. It is dubious C code to write something like

Ah, I was actually really looking for something like this but I couldn't find it anywhere in the Cython documentation. Am I just blind?

Nevermind; I see now that we explicitly pass that in to cythonize in the setup.py. For that matter, any reason not to add simply PY2 and PY3 compile-time constants? Comparing against PY_VERSION_HEX is a little annoying.

@fchapoton
Copy link
Contributor

comment:17

It will be difficult to even find those specific use cases until and unless we get >further along on getting Python 3 working in general (with these functions, plus a >few other fixes I'll be posting soon, I've gotten the Sage doctest runner working, so >that will help expose a lot of interesting cases quickly).

This sounds great ! I was hoping that the doctest framework could be made to work at some point, but was not expecting it soon.

@embray
Copy link
Contributor Author

embray commented Nov 17, 2017

comment:18

Replying to @fchapoton:

It will be difficult to even find those specific use cases until and unless we get >further along on getting Python 3 working in general (with these functions, plus a >few other fixes I'll be posting soon, I've gotten the Sage doctest runner working, so >that will help expose a lot of interesting cases quickly).

This sounds great ! I was hoping that the doctest framework could be made to work at some point, but was not expecting it soon.

I had to get it working, in part, so that I could run the doctests for this module :)

I think it will help things go much faster.

@embray
Copy link
Contributor Author

embray commented Nov 17, 2017

comment:19

I'm seeing now how wanting to have module-level global variables in conjunction with inline c(p)def functions in Cython gets problematic. It just compiles them as __Pyx_GetModuleGlobalName(name). I feel like this should be a bug (I think not too difficult to fix?) in Cython, since it's really counter to how Python should work in this case.

For functions inlined from another module--at least if those functions access global variables from their original module--it should import that module during module initialization and use the correct module dict for globals lookups (just as normal Python functions do, basically).

That's an issue beyond this one though, so I'll rework things for now to get rid use of the global variables by these functions (I still want to have default encodings though).

@embray
Copy link
Contributor Author

embray commented Nov 17, 2017

comment:20

So it turns out PyUnicode_AsEncodedString and PyUnicode_Decode do allow the encoding argument to be NULL, in which case they default to UTF-8. So if it's a good enough default for Python maybe it's a good enough default for us, for now.

Perhaps I'll just stick with that functionality, and take care to use things like locale.getpreferredencoding() when necessary, such as interfacing with external software. To that end it would be good to come up with some list of exactly what software Sage interfaces with is locale aware. Certainly ECL and GAP are strong candidates; probably others. It will be good to add some tests to that effect but we can worry about it when we come to it.

@jdemeyer
Copy link

comment:21

Replying to @embray:

Python itself uses default encodings all over the place when in doubt

Right, but there are several defaults (each of UTF-8, sys.getfilesystemencoding() and locale.getpreferredencoding() could be considered as defaults). And I don't think that any of those 3 has the large majority. So I still feel that forcing the user to pick one is the best solution.

@jdemeyer
Copy link

comment:22

Also, I feel that error handling should be different for the different cases: if you are communicating with locale-aware software using locale.getpreferredencoding() you probably want decoding errors to be actual errors. With the file system encoding on the other hand, surrogateescape is typically a better default.

@jdemeyer
Copy link

comment:23

Replying to @embray:

For that matter, any reason not to add simply PY2 and PY3 compile-time constants? Comparing against PY_VERSION_HEX is a little annoying.

No problem for me, although I would prefer PY_MAJOR_VERSION then. If you want to do that, please make a new ticket and make it depend on #24215 please.

@embray
Copy link
Contributor Author

embray commented Dec 15, 2017

comment:84

Fine but it really doesn't make much difference.

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Dec 15, 2017

Changed commit from 2bb8396 to 6d73a41

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Dec 15, 2017

Branch pushed to git repo; I updated commit sha1. New commits:

6d73a41Use more specific type declarations here.

@jdemeyer
Copy link

comment:87

Good for me if it passes testing.

@fchapoton
Copy link
Contributor

comment:88

one green bot. I am setting to positive this important ticket.

@vbraun
Copy link
Member

vbraun commented Dec 17, 2017

comment:89

Merge conflict

@fchapoton
Copy link
Contributor

comment:90

Volker, do you by chance have any idea about what was the conflicting ticket ?

@embray
Copy link
Contributor Author

embray commented Dec 21, 2017

comment:91

Merge conflict with what? If there were a normal "master" branch into which tickets were merged regularly against which I could compare that would be one thing, but you can't just secretly merge a bunch of tickets all at once, claim "merge conflict", and expect me to guess what it conflicts with.

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Dec 21, 2017

Changed commit from 6d73a41 to dec9f3a

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Dec 21, 2017

Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:

a285af4Add new string conversion utilities.
5b3719aAdd support for optional error handling.
bf5f893Make type checks explicit
1bd27fcRequire a const char*--this is more consistent for use in cases where
bd2d931Use a less unusual character in the tests
b32f000Use r-strings for the docstrings since they contain slashes
ccf7bcdClean up these declarations:
dec9f3aUse more specific type declarations here.

@embray
Copy link
Contributor Author

embray commented Dec 21, 2017

comment:93

Rebased on current develop branch if it helps, but there was no merge conflict.


New commits:

a285af4Add new string conversion utilities.
5b3719aAdd support for optional error handling.
bf5f893Make type checks explicit
1bd27fcRequire a const char*--this is more consistent for use in cases where
bd2d931Use a less unusual character in the tests
b32f000Use r-strings for the docstrings since they contain slashes
ccf7bcdClean up these declarations:
dec9f3aUse more specific type declarations here.

@jdemeyer
Copy link

comment:94

Replying to @embray:

Merge conflict with what? If there were a normal "master" branch into which tickets were merged regularly against which I could compare that would be one thing, but you can't just secretly merge a bunch of tickets all at once, claim "merge conflict", and expect me to guess what it conflicts with.

It's not secret: https://github.com/vbraun/sage/tree/develop

@jdemeyer
Copy link

Changed dependencies from #24246 to none

@jdemeyer
Copy link

comment:95

I don't see any conflict...

@vbraun
Copy link
Member

vbraun commented Dec 26, 2017

Changed branch from u/embray/python3/string-conversions to dec9f3a

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants