Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

erroneous use of charwidth in rpad? #10825

Closed
jiahao opened this issue Apr 15, 2015 · 13 comments
Closed

erroneous use of charwidth in rpad? #10825

jiahao opened this issue Apr 15, 2015 · 13 comments
Labels
docs This change adds or pertains to documentation unicode Related to unicode characters and encodings
Milestone

Comments

@jiahao
Copy link
Member

jiahao commented Apr 15, 2015

julia> a = rpad("\u2003", 5)
""

julia> length(a) #4 in 0.4-dev, 5 in 0.3.7
4

julia> map(x->convert(Uint32, x), collect(a)) #0.4-dev has one fewer space
4-element Array{UInt32,1}:
 0x00002003
 0x00000020
 0x00000020
 0x00000020

help?> rpad
search: rpad repeated macroexpand isdirpath tryparse normpath repmat repeat replace workspace realpath redisplay AbstractSparseArray

Base.rpad(string, n, p)

   Make a string at least "n" characters long by padding on the
   right with copies of "p".

help?> charwidth
search: charwidth

Base.charwidth(c)

   Gives the number of columns needed to print a character.
@StefanKarpinski
Copy link
Sponsor Member

I guess the issue here is whether rpad is meant for producing a specific number of characters or a specific number of columns. Both things are useful.

@stevengj
Copy link
Member

@jiahao, what was the issue with width-2 characters in 4af443f? Conversely, what would you want to do with zero-width combining characters — should rpad("x̂", 2) return "x̂ " (the current behavior) or "x̂" (if it were producing a specific number of codepoints)?

Or a third option: maybe rpad should produce a specific number of graphemes? This would treat width-1 and width-2 graphemes identically, but would still give "x̂ " from rpad("x̂", 2).

@stevengj stevengj added the unicode Related to unicode characters and encodings label Apr 15, 2015
@jiahao
Copy link
Member Author

jiahao commented Apr 15, 2015

I was using rpad to generate a ReST simple table, where the columns had to be the exact same length. Sphinx considered each doublewidth character as fitting in 1 column and complained that the text was no longer correctly aligned. I think it does boil down to what the second argument means - I don't think "character" is well defined anymore as it is currently used in the docs.

@stevengj
Copy link
Member

@jiahao, what does Sphinx consider to be a "character"? Any codepoint? Nonzero-width codepoints? Graphemes?

@jiahao
Copy link
Member Author

jiahao commented Apr 19, 2015

I think Sphinx uses Python's definition of 1 char = 1 code point.

>>> len("e\u0302") #Python 3.4.3
2

>>> len(u"e\u0302") #Python 2.7.6
2

@stevengj
Copy link
Member

In this case, you need a rewrite of rpad based on length(s) for your use-case. But I'm skeptical that this should be the default.

@jiahao
Copy link
Member Author

jiahao commented Apr 20, 2015

Yes, unfortunately it looks like uses of rpad have to be changed based on whether the destination string is intended be displayed to the user or if the string is intended to be fed into an external program. Which is too bad, because the inconsistent handling of spacing by downstream programs is guaranteed to cause more headaches.

@stevengj
Copy link
Member

(Not to mention the inconsistent handling of strings by displays. c.f. #3721)

@elextr
Copy link

elextr commented Apr 20, 2015

I really needs two rpads, one for code points and one for graphemes, as @StefanKarpinski said, both are useful.

Or a keyword parameter rpad("xxx", 5, to=:code_point). That only leaves which is default as a discussion of the hue of the bipedal transport garaging :)

@JeffBezanson
Copy link
Sponsor Member

Padding based on columns feels like the most sensible behavior to me; the help text should be updated.

I wonder if lpad and rpad should be refactored. They have duplicated code, and it would get worse if we added options. These operations could be written something like s * padding(s, pad, columns=10) or padding(s, pad, codepoints=5) * s etc.

@StefanKarpinski StefanKarpinski added this to the 0.6.0 milestone Sep 14, 2016
@be5invis
Copy link

be5invis commented Dec 2, 2016

Deciding the real spaced needed for layouting (in console or something else, whatever) is extremely hard. for example, some CJK fonts may make α full-width.

@JeffBezanson JeffBezanson added the docs This change adds or pertains to documentation label Jan 5, 2017
@JeffBezanson
Copy link
Sponsor Member

The help text for this has already been updated to say that padding is based on columns. There are other kinds of padding you might want, but the column behavior is useful too.

@tkelman
Copy link
Contributor

tkelman commented Jan 5, 2017

x-ref f65befe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs This change adds or pertains to documentation unicode Related to unicode characters and encodings
Projects
None yet
Development

No branches or pull requests

7 participants