Unicode : change df.to_string() and friends to always return unicode objects #2224

ghost · 2012-11-11T18:25:14Z

Note: Although all the tests pass with minor fixes, this PR has an above-average chance of
breaking things for people who have relied on broken behaviour thus far.

df.tidy_repr combines several strings to produce a result. when one component is unicode
and other other is a non-ascii bytestring, it tries to convert the latter back to a unicode string
using the 'ascii' codec and fails.

I suggest that _get_repr -> to_string should always return unicode, as implemented by this PR,
and that the force_unicode argument be deprecated everyhwere.

The force_unicode argument in to_string conflates two things:

which codec to use to decode the string (which can only be a hopeful guess)
whether to return a unicode() object or str() object,

The first is now no longer necessary since pprint_thing already resorts to the same hack
of using utf-8 (with errors='replace') as a fallback.
I believe making the latter optional is wrong, precisely because it brings about situations
like the test case above.
to_string, like all internal functions , should utilize unicode objects, whenever feasible.

wesm · 2012-11-11T18:33:11Z

This seems pretty reasonable. Should I take a chance merging this for 0.9.1? I've encountered the bug you fixed here before

ghost · 2012-11-11T18:39:09Z

I would at least wait a few days before merging this (perhaps @jseabold or someone else would like
to argue their use-case ).

wesm · 2012-11-11T18:43:47Z

I guess the question is what code will break because the string is coming back as unicode. Obviously if you had df.to_string(force_unicode=True).decode('utf-8') that is going to break. Maybe this should be held off until 0.10 series

ghost · 2012-11-11T18:47:58Z

it depends whether you consider this a bug fix or a breaking change. I'm fine with 0.10 though.

changhiskhan · 2012-11-12T16:51:04Z

Let wait 'til 0.10. Let's merge it into master as soon as the release is out though.

wesm · 2012-11-12T17:03:11Z

Agreed...

aldanor · 2012-11-13T15:04:17Z

This would be great. As of right now, you have to do something dirty (at least that's the only way I found it works) like DataFrame(series).to_string(force_unicode=True, header=False) to correctly print a Series object with unicode characters to a utf-8 console.

ghost · 2012-11-14T01:51:26Z

I took this a step further, Realizing that the unicode issue really matters only
when we want to get a string representation of an object.

So:

I Converted more related functions to work exclusively with unicode.
Since everything should taper down to pprint_thing at the bottom, any utf-8 bytestrings
should get silently decoded into unicode.
If your data is not unicode and not utf-8, it's unreasonable to expect str(df) to do
the right thing, and so you'll get � (the unicode replacement character), but not exceptions
(hopefully).
fixing a couple of corner cases along the way, I added all the boilerplae so that
str(x)/unicode(x)/bytes(x) work on py2 and py3 for series/df/panel.

Yell if something broke.

wesm · 2012-11-21T04:58:50Z

@aldanor I see you deleted your comment but I checked that your example works now, at least on my environment...

aldanor · 2012-11-21T05:08:27Z

@wesm Thanks, sounds good. I just didn't want to confuse everyone cause I wasn't sure this wasn't something specific to my environment. I will try and test it again soon as I can.

…e force_unicode #2225 using pprint_thing will try to decode using utf-8 as a fallback, but by these functions will now return unicode() rather then str() objects.

…ter, Index.format, etc'

…g strings) we need to keep everything unicode at the bottom levels, so that we can combine strings with other unicode strings at the I/O choke-points, otherwise python tries to coerce bytestring into unicode using 'ascii' encoding, and we get UnicodeDecodeError DOC: add note about formatters needing to return unicode )if returning strings) we need to keep everything unicode at the bottom levels, so that we can combine strings with other unicode strings at the I/O choke-points, otherwise python tries to coerce bytestring into unicode using 'ascii' encoding, and we get UnicodeDecodeError

…f/series containing unicode

…ries,df,panel - If you put in proper unicode data, you're good. - If you put in utf-8 bytestrings you should still be good (it works if rendering is wrapped by pprint_thing, I may have missed a few spots). - If you put in non utf-8 bytestrings, with the encoding unknown, and expect unicode(x) or str(x) to do the right thing - you're doing it wrong.

…ultiIndex

…dex,MultiIndex

ghost · 2012-11-22T18:58:59Z

Added str/unicode/bytes support for Index,MultiIndex.

ghost · 2012-11-27T03:02:27Z

takeback

y-p added 9 commits November 22, 2012 20:48

TST: series tidy_repr with unicode data values

ac0898f

ENH: Series tidy_repr should use pprint_thing and console_encode #2225

007622d

ENH: to_string() and to_str_columns() should return unicode, deprecat…

2599741

…e force_unicode #2225 using pprint_thing will try to decode using utf-8 as a fallback, but by these functions will now return unicode() rather then str() objects.

ENH: convert more internal string processing to unicode, SeriesFormat…

a34ac81

…ter, Index.format, etc'

TST: str(x)/unicode(x),bytes(x)/str(x) should always work if x is a d…

c22da50

…f/series containing unicode

TST: str(x)/unicode(x),bytes(x)/str(x) should always work for Index,M…

f0deaa6

…ultiIndex

ENH: py2/py3 support for str(x)/unicode(x) and bytes(x)/str(x) for In…

436bf36

…dex,MultiIndex

wesm added a commit that referenced this pull request Nov 27, 2012

Merge y-p/unicode__ #2224

a240b29

wesm merged commit 436bf36 into pandas-dev:master Nov 27, 2012

This was referenced Apr 3, 2014

DEPR: create issues for the current FutureWarnings in pandas #6641

Closed

Remove number of deprecated parameters/functions/classes [fix #6641] #6813

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode : change df.to_string() and friends to always return unicode objects #2224

Unicode : change df.to_string() and friends to always return unicode objects #2224

ghost commented Nov 11, 2012

wesm commented Nov 11, 2012

ghost commented Nov 11, 2012

wesm commented Nov 11, 2012

ghost commented Nov 11, 2012

changhiskhan commented Nov 12, 2012

wesm commented Nov 12, 2012

aldanor commented Nov 13, 2012

ghost commented Nov 14, 2012

wesm commented Nov 21, 2012

aldanor commented Nov 21, 2012

ghost commented Nov 22, 2012

ghost commented Nov 27, 2012

Unicode : change df.to_string() and friends to always return unicode objects #2224

Unicode : change df.to_string() and friends to always return unicode objects #2224

Conversation

ghost commented Nov 11, 2012

wesm commented Nov 11, 2012

ghost commented Nov 11, 2012

wesm commented Nov 11, 2012

ghost commented Nov 11, 2012

changhiskhan commented Nov 12, 2012

wesm commented Nov 12, 2012

aldanor commented Nov 13, 2012

ghost commented Nov 14, 2012

wesm commented Nov 21, 2012

aldanor commented Nov 21, 2012

ghost commented Nov 22, 2012

ghost commented Nov 27, 2012