py3 : introducing string conversion tools #24186

fchapoton · 2017-11-10T12:43:26Z

CC: @embray @jdemeyer @tscrim @kiwifb

Component: python3

Keywords: unicode

Branch/Commit: u/chapoton/24186 @ 0e6a9e5

Reviewer: Frédéric Chapoton

Issue created by migration from https://trac.sagemath.org/ticket/24186

The text was updated successfully, but these errors were encountered:

fchapoton · 2017-11-10T12:44:42Z

Commit: 0e6a9e5

fchapoton · 2017-11-10T12:44:42Z

Changed keywords from none to unicode

fchapoton · 2017-11-10T12:44:42Z

New commits:

`0e6a9e5`	`py3: introducing string conversion tools`

fchapoton · 2017-11-10T12:44:42Z

Branch: u/chapoton/24186

embray · 2017-11-10T14:16:30Z

comment:2

This is definitely the right direction. I'm not 100% sure how I feel about string_to_str. Usually it's not desirable on Python 2, if you already have a unicode object, to convert it to str. This function has very orthogonal behavior between Python 2 and Python 3 because on Python 2 it takes a unicode string and returns a bytes string, whereas on Python 3 it takes a unicode string and returns the same unicode string--very confusing and not clear where that behavior is useful.

Rather than "string_to_..." I think it's more useful to have a set of four functions:

bytes_to_str, str_to_bytes, unicode_to_str, str_to_unicode.

On Python 2 bytes_to_str and str_to_bytes are no-ops because bytes is str. Keeping cases that were already str on Python 2 is usually preferable than converting to unicode everywhere. This is more backwards compatible. You don't want to start converting str to unicode everywhere that wasn't unicode before if you don't have to.

Similarly on Python 3 unicode_to_str and str_to_unicode are no-ops. This pair is also not usually as useful but it can be sometimes I suppose.

embray · 2017-11-10T14:20:47Z

comment:3

More like I described here: #24059 comment:3

embray · 2017-11-10T14:21:52Z

comment:4

That said these functions might be useful too, though I'm curious to see some examples of what you have in mind as to where they'd be useful.

embray · 2017-11-10T14:31:49Z

comment:5

If it helps clarify--the reason for my logic here is that the areas where it is most common to want to convert between bytes/unicode is at the boundary of system functions. On Python 2 the tradition here is "str in str out". This can cause problems of course, which is why the unicode model was changed so drastically on Python 3. But the fact remains, unless there was a specific reason to convert from str to unicode at the "out" end (for example, we know we're receiving non-ASCII encoded text), usually "str in str out" is the default on Python 2.

However on Python 3 things are very different--we still want, in most cases "str in str out". This is because users and developers are simply more used to dealing with str (that is, the object created with "" or '' literals without any prefixes). But on Python 3 this means we have to explicitly handle the "unicode" sandwich "str -> bytes -> -> bytes -> str".

If we want to write the same code on Python 2 and Python 3 we can write to_str(system_api(to_bytes(my_str))). Where to_str and to_bytes are defined as in my comment linked above. This properly preserves the "str in str out" idiom on both Python 2 and 3.

Of course there are exceptions to this case but they are less common.

jdemeyer · 2017-11-10T14:34:48Z

comment:6

And you really want to implement these in Cython, not plain Python.

embray · 2017-11-10T14:38:58Z

comment:7

Indeed--for example for the Python to str_to_bytes() this is a no-op. It just immediately returns its input. In C this could just be a macro but unfortunately Cython doesn't make it possible to define things as macros. That said, if this is a cpdef function, between Cython and the C compiler it should be able to optimize it away entirely. I'm going to have a look if that's actually the case though...

jdemeyer · 2017-11-10T14:53:17Z

comment:8

Replying to @embray:

between Cython and the C compiler it should be able to optimize it away entirely.

...if it's a cdef inline function.

Something like

cdef extern from *:
    int PY_MAJOR_VERSION

cdef inline bytes_to_str(x):
    if PY_MAJOR_VERSION <= 2:
        return <str>x
    return x.decode(encoding)

Ideally, that would be optimized away entirely in Python 2. This is not entirely true, because Cython still changes some refcounts. But the effect of that should be negligible.

embray · 2017-11-10T15:09:26Z

comment:9

Indeed since PY_MAJOR_VERSION is a constant that branch gets optimized away completely.

You could make it a cpdef so that the function can be used from pure Python as well.

fchapoton · 2017-11-10T16:27:46Z

comment:10

Are you going to propose a branch ?

embray · 2017-11-10T17:08:32Z

comment:11

Sure if you don't mind.

fchapoton · 2017-11-12T09:01:23Z

comment:12

Of course I don't mind. I am asking for that since long.

jdemeyer · 2017-11-13T09:18:46Z

comment:13

Erik, will you do that or should I?

One suggestion: I suggest to implement conversion char* -> str and then do conversion bytes -> str using that. This is because in many cases bytes in Cython code comes from char* in C and the direct conversion char* -> str will be faster than char* -> bytes -> str. And char* -> str is easy using PyString_FromString (Python 2) or PyUnicode_DecodeFSDefault (Python 3).

embray · 2017-11-13T12:40:51Z

comment:15

Do you mean as separate functions (I'm not sure how else one would do that)?

jdemeyer · 2017-11-13T12:46:27Z

comment:16

Replying to @embray:

Do you mean as separate functions (I'm not sure how else one would do that)?

I was thinking something like

cdef str array_to_str(char* x):
    ...

cpdef str bytes_to_str(b):
    return array_to_str(PyBytes_AsString(b))

embray · 2017-11-15T08:05:43Z

comment:17

Okay, I have an implementation of this I'll post to a new ticket in a bit. I just want to try it out a bit in practice more.

slel · 2017-11-21T20:02:21Z

comment:18

Erik's implementation of bytes_to_str and str_to_bytes is at #24222 and needs review.

jdemeyer · 2017-11-28T09:37:43Z

comment:19

What is the point of this ticket in the light of #24222? Can we consider this a duplicate?

embray · 2017-11-28T13:40:36Z

comment:20

Replying to @jdemeyer:

What is the point of this ticket in the light of #24222? Can we consider this a duplicate?

I mean in fairness this came before #24222, which was my offer of an alternative. I think we can do without this for now. One thing I do like about it is that it supports conversion of unicode objects on Python 2 to "bytes". #24222 eschews any attempt at general unicode support on Python 2, but it might be worth adding support for passing unicode through my str_to_bytes() function at some point.

Other than that though I don't think this ticket itself is needed.

fchapoton · 2017-11-30T16:12:51Z

comment:21

then let us close this one as duplicate

jdemeyer · 2017-11-30T16:14:59Z

Reviewer: Frédéric Chapoton

jdemeyer · 2017-11-30T16:14:59Z

Changed author from Frédéric Chapoton to none

fchapoton added this to the sage-8.1 milestone Nov 10, 2017

fchapoton added c: python3 labels Nov 10, 2017

fchapoton added the s: needs review label Nov 10, 2017

jdemeyer added s: needs work and removed s: needs review labels Nov 13, 2017

fchapoton removed this from the sage-8.1 milestone Nov 30, 2017

fchapoton added s: needs review and removed s: needs work labels Nov 30, 2017

jdemeyer removed the s: needs review label Nov 30, 2017

jdemeyer closed this as completed Nov 30, 2017

jdemeyer added r: duplicate and removed p: major / 3 labels Nov 30, 2017

embray mentioned this issue Dec 26, 2017

py3: simplified string conversion utilities #24222

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

py3 : introducing string conversion tools #24186

py3 : introducing string conversion tools #24186

fchapoton commented Nov 10, 2017

fchapoton commented Nov 10, 2017

fchapoton commented Nov 10, 2017

fchapoton commented Nov 10, 2017

fchapoton commented Nov 10, 2017

embray commented Nov 10, 2017

embray commented Nov 10, 2017

embray commented Nov 10, 2017

embray commented Nov 10, 2017

jdemeyer commented Nov 10, 2017

embray commented Nov 10, 2017

jdemeyer commented Nov 10, 2017

embray commented Nov 10, 2017

fchapoton commented Nov 10, 2017

embray commented Nov 10, 2017

fchapoton commented Nov 12, 2017

jdemeyer commented Nov 13, 2017

embray commented Nov 13, 2017

jdemeyer commented Nov 13, 2017

embray commented Nov 15, 2017

slel commented Nov 21, 2017

jdemeyer commented Nov 28, 2017

embray commented Nov 28, 2017

fchapoton commented Nov 30, 2017

jdemeyer commented Nov 30, 2017

jdemeyer commented Nov 30, 2017

py3 : introducing string conversion tools #24186

py3 : introducing string conversion tools #24186

Comments

fchapoton commented Nov 10, 2017

fchapoton commented Nov 10, 2017

fchapoton commented Nov 10, 2017

fchapoton commented Nov 10, 2017

fchapoton commented Nov 10, 2017

embray commented Nov 10, 2017

embray commented Nov 10, 2017

embray commented Nov 10, 2017

embray commented Nov 10, 2017

jdemeyer commented Nov 10, 2017

embray commented Nov 10, 2017

jdemeyer commented Nov 10, 2017

embray commented Nov 10, 2017

fchapoton commented Nov 10, 2017

embray commented Nov 10, 2017

fchapoton commented Nov 12, 2017

jdemeyer commented Nov 13, 2017

embray commented Nov 13, 2017

jdemeyer commented Nov 13, 2017

embray commented Nov 15, 2017

slel commented Nov 21, 2017

jdemeyer commented Nov 28, 2017

embray commented Nov 28, 2017

fchapoton commented Nov 30, 2017

jdemeyer commented Nov 30, 2017

jdemeyer commented Nov 30, 2017