Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

py3 : introducing string conversion tools #24186

Closed
fchapoton opened this issue Nov 10, 2017 · 25 comments
Closed

py3 : introducing string conversion tools #24186

fchapoton opened this issue Nov 10, 2017 · 25 comments

Comments

@fchapoton
Copy link
Contributor

CC: @embray @jdemeyer @tscrim @kiwifb

Component: python3

Keywords: unicode

Branch/Commit: u/chapoton/24186 @ 0e6a9e5

Reviewer: Frédéric Chapoton

Issue created by migration from https://trac.sagemath.org/ticket/24186

@fchapoton fchapoton added this to the sage-8.1 milestone Nov 10, 2017
@fchapoton
Copy link
Contributor Author

Commit: 0e6a9e5

@fchapoton
Copy link
Contributor Author

Changed keywords from none to unicode

@fchapoton
Copy link
Contributor Author

New commits:

0e6a9e5py3: introducing string conversion tools

@fchapoton
Copy link
Contributor Author

Branch: u/chapoton/24186

@embray
Copy link
Contributor

embray commented Nov 10, 2017

comment:2

This is definitely the right direction. I'm not 100% sure how I feel about string_to_str. Usually it's not desirable on Python 2, if you already have a unicode object, to convert it to str. This function has very orthogonal behavior between Python 2 and Python 3 because on Python 2 it takes a unicode string and returns a bytes string, whereas on Python 3 it takes a unicode string and returns the same unicode string--very confusing and not clear where that behavior is useful.

Rather than "string_to_..." I think it's more useful to have a set of four functions:

bytes_to_str, str_to_bytes, unicode_to_str, str_to_unicode.

On Python 2 bytes_to_str and str_to_bytes are no-ops because bytes is str. Keeping cases that were already str on Python 2 is usually preferable than converting to unicode everywhere. This is more backwards compatible. You don't want to start converting str to unicode everywhere that wasn't unicode before if you don't have to.

Similarly on Python 3 unicode_to_str and str_to_unicode are no-ops. This pair is also not usually as useful but it can be sometimes I suppose.

@embray
Copy link
Contributor

embray commented Nov 10, 2017

comment:3

More like I described here: #24059 comment:3

@embray
Copy link
Contributor

embray commented Nov 10, 2017

comment:4

That said these functions might be useful too, though I'm curious to see some examples of what you have in mind as to where they'd be useful.

@embray
Copy link
Contributor

embray commented Nov 10, 2017

comment:5

If it helps clarify--the reason for my logic here is that the areas where it is most common to want to convert between bytes/unicode is at the boundary of system functions. On Python 2 the tradition here is "str in str out". This can cause problems of course, which is why the unicode model was changed so drastically on Python 3. But the fact remains, unless there was a specific reason to convert from str to unicode at the "out" end (for example, we know we're receiving non-ASCII encoded text), usually "str in str out" is the default on Python 2.

However on Python 3 things are very different--we still want, in most cases "str in str out". This is because users and developers are simply more used to dealing with str (that is, the object created with "" or '' literals without any prefixes). But on Python 3 this means we have to explicitly handle the "unicode" sandwich "str -> bytes -> -> bytes -> str".

If we want to write the same code on Python 2 and Python 3 we can write to_str(system_api(to_bytes(my_str))). Where to_str and to_bytes are defined as in my comment linked above. This properly preserves the "str in str out" idiom on both Python 2 and 3.

Of course there are exceptions to this case but they are less common.

@jdemeyer
Copy link

comment:6

And you really want to implement these in Cython, not plain Python.

@embray
Copy link
Contributor

embray commented Nov 10, 2017

comment:7

Indeed--for example for the Python to str_to_bytes() this is a no-op. It just immediately returns its input. In C this could just be a macro but unfortunately Cython doesn't make it possible to define things as macros. That said, if this is a cpdef function, between Cython and the C compiler it should be able to optimize it away entirely. I'm going to have a look if that's actually the case though...

@jdemeyer
Copy link

comment:8

Replying to @embray:

between Cython and the C compiler it should be able to optimize it away entirely.

...if it's a cdef inline function.

Something like

cdef extern from *:
    int PY_MAJOR_VERSION

cdef inline bytes_to_str(x):
    if PY_MAJOR_VERSION <= 2:
        return <str>x
    return x.decode(encoding)

Ideally, that would be optimized away entirely in Python 2. This is not entirely true, because Cython still changes some refcounts. But the effect of that should be negligible.

@embray
Copy link
Contributor

embray commented Nov 10, 2017

comment:9

Indeed since PY_MAJOR_VERSION is a constant that branch gets optimized away completely.

You could make it a cpdef so that the function can be used from pure Python as well.

@fchapoton
Copy link
Contributor Author

comment:10

Are you going to propose a branch ?

@embray
Copy link
Contributor

embray commented Nov 10, 2017

comment:11

Sure if you don't mind.

@fchapoton
Copy link
Contributor Author

comment:12

Of course I don't mind. I am asking for that since long.

@jdemeyer
Copy link

comment:13

Erik, will you do that or should I?

One suggestion: I suggest to implement conversion char* -> str and then do conversion bytes -> str using that. This is because in many cases bytes in Cython code comes from char* in C and the direct conversion char* -> str will be faster than char* -> bytes -> str. And char* -> str is easy using PyString_FromString (Python 2) or PyUnicode_DecodeFSDefault (Python 3).

@embray
Copy link
Contributor

embray commented Nov 13, 2017

comment:15

Do you mean as separate functions (I'm not sure how else one would do that)?

@jdemeyer
Copy link

comment:16

Replying to @embray:

Do you mean as separate functions (I'm not sure how else one would do that)?

I was thinking something like

cdef str array_to_str(char* x):
    ...

cpdef str bytes_to_str(b):
    return array_to_str(PyBytes_AsString(b))

@embray
Copy link
Contributor

embray commented Nov 15, 2017

comment:17

Okay, I have an implementation of this I'll post to a new ticket in a bit. I just want to try it out a bit in practice more.

@slel
Copy link
Member

slel commented Nov 21, 2017

comment:18

Erik's implementation of bytes_to_str and str_to_bytes is at #24222 and needs review.

@jdemeyer
Copy link

comment:19

What is the point of this ticket in the light of #24222? Can we consider this a duplicate?

@embray
Copy link
Contributor

embray commented Nov 28, 2017

comment:20

Replying to @jdemeyer:

What is the point of this ticket in the light of #24222? Can we consider this a duplicate?

I mean in fairness this came before #24222, which was my offer of an alternative. I think we can do without this for now. One thing I do like about it is that it supports conversion of unicode objects on Python 2 to "bytes". #24222 eschews any attempt at general unicode support on Python 2, but it might be worth adding support for passing unicode through my str_to_bytes() function at some point.

Other than that though I don't think this ticket itself is needed.

@fchapoton
Copy link
Contributor Author

comment:21

then let us close this one as duplicate

@fchapoton fchapoton removed this from the sage-8.1 milestone Nov 30, 2017
@jdemeyer
Copy link

Reviewer: Frédéric Chapoton

@jdemeyer
Copy link

Changed author from Frédéric Chapoton to none

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants