Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Julia doesn't like Pizza #3721

Closed
Keno opened this issue Jul 15, 2013 · 61 comments
Closed

Julia doesn't like Pizza #3721

Keno opened this issue Jul 15, 2013 · 61 comments
Labels
😃🍕 and other emoji bug Indicates an unexpected problem or unintended behavior unicode Related to unicode characters and encodings upstream The issue is with an upstream dependency, e.g. LLVM

Comments

@Keno
Copy link
Member

Keno commented Jul 15, 2013

julia> x = '\U1f355'
'\U1f355'

julia> charwidth(x)
0
@staticfloat
Copy link
Member

I nominate for most interesting issue subject.

In other news, this looks like something to take up with wcwidth; it doesn't even think pizza is printable:

$ cat test.c
#include <wchar.h>
#include <stdio.h>

int main( void ) {
    wchar_t pizza = 0x1fe55;
    printf("%d\n", wcwidth(pizza) );
    return 0;
}

$  gcc -o test test.c && ./test
-1

We don't allow negative lengths, so we clip to 0, which seems pretty reasonable to me

@StefanKarpinski
Copy link
Member

This prints just fine on my terminal and looks to have a width of two. Not sure how we should handle this given that the C function is wrong.

@jiahao
Copy link
Member

jiahao commented Nov 5, 2013

Revisiting this, it appears to work on OSX Mavericks 10.9, although it reports a charwidth() of 1, so the ending single quote that wraps the character is overlaid on some rounds of pepperoni.

image

@JeffBezanson
Copy link
Member

Unsurprisingly apple seems to be the first to update their unicode tables. I think our only options here are to rely on whatever libc is available, or use our own unicode tables. Not sure we want to get into all that.

@jiahao
Copy link
Member

jiahao commented Nov 6, 2013

Is it possible to test some characters during build time and emit runtime warnings if charwidths are known to be wrong?

@StefanKarpinski
Copy link
Member

Eh, what's the point? That's just a warning that people are going to ignore or re-open this issue. If the system libc has the wrong character width for something, then you get mangled garbage. Get a better OS.

@timholy
Copy link
Member

timholy commented Nov 6, 2013

The fact that this even exists as a character is a clear sign that unicode allows too many bits :-).

@quinnj
Copy link
Member

quinnj commented Jun 23, 2014

What's the status here? @Keno?

@JeffBezanson
Copy link
Member

Looks like somebody submitted a patch to glibc yesterday to update their unicode data:
https://sourceware.org/ml/libc-alpha/2014-06/msg00585.html

@quinnj
Copy link
Member

quinnj commented Aug 21, 2014

Did the bump from libutf8proc to libmojibake solve this? #7917.

@jakebolewski
Copy link
Member

I don't think the char width problem has been solved yet. @jiahao went through all the codepoints and computed the correct widths, but this information has not made it into libmojibake.

@quinnj
Copy link
Member

quinnj commented Aug 29, 2014

ping @jiahao

@jiahao
Copy link
Member

jiahao commented Aug 29, 2014

The last time I discussed this with @stevengj, we weren't entirely settled on whether the new charwidth function should be submitted as an entirely new function to JuliaStrings/utf8proc#2 or as a patch to Julia's existing charwidth.

A correct charwidth would be useful to projects other than Julia, but it would mean significant new functionality to libmojibake and it would cease to be just a lightly updated fork of a minimal Unicode handling library.

@JeffBezanson
Copy link
Member

I really think it makes sense for it to be in libmojibake. It's in line with the other functionality in there, and won't bloat the library by a large percentage.

@stevengj
Copy link
Member

I don't actually care much either way, but I'm fine with putting it in libmojibake.

@stevengj
Copy link
Member

The advantage of putting it in Julia (replacing src/support/wcwidth.c) is that it will still work if someone is using the system utf8proc (which seems not unlikely on e.g. Fedora, especially if the Unicode-7 support in libmojibake gets folded upstream).

@JeffBezanson
Copy link
Member

Ok, then let's drop it in as our wcwidth.c to fix this issue, and possibly move it to libmojibake later.

@stevengj
Copy link
Member

utf8proc now includes an up-to-date utf8proc_charwidth function based on @jiahao's analysis (JuliaStrings/utf8proc#27), so we can fix this issue by upgrading to the latest utf8proc and using this function instead of wcwidth. (utf8proc's charwidth for U+1f355 is 2.)

@tkelman
Copy link
Contributor

tkelman commented Mar 12, 2015

We should probably turn the name back to utf8proc here. I'd also slightly prefer going back to using a tarball for it (once 1.2.0 is tagged) rather than a submodule.

@nalimilan
Copy link
Member

+1

@stevengj
Copy link
Member

I slightly prefer submodules, since the tarballs tend to leave a bunch of old versions littering the deps/ directory when the version is upgraded. Submodules are also somewhat more flexible, since we can link a pre-release version if there is an urgent need (e.g. a bugfix).

@tkelman
Copy link
Contributor

tkelman commented Mar 14, 2015

Submodules tend to confuse newcomers when versions are upgraded, introducing confusing diffs after they git pull when we change the submodule. It's also a bit messier for packagers who want to use system versions. It should be possible to set UTF8PROC_VERSION to a non-release sha for testing, and github should just make the right tarball for us. Either way though.

@gnachman
Copy link

gnachman commented Apr 1, 2015

@stevengj iTerm2 does need to interoperate with anything you can ssh, telnet, etc. to, not to mention Julia, so I'm open to giving users a way to opt in to more a sensible wcwidth(). I don't use wcwidth() on the client so I could use utf8proc_charwidth in the right circumstances. Since AFAIK only Julia departs from the standard, there'd need to be a new escape sequence to tell the terminal emulator to switch character-width lookup tables.

OTOH, since Julia is the black sheep in this regard, it probably makes the most sense for Julia to print a space after characters that it treats as wide but wcwidth does not. And deal with cursor movement across them correctly, etc. That'll work with every terminal out there. If a window gets resized it won't wrap correctly, though. Terminal and iTerm2 will both refuse to "break" a fullwidth character into two half-width pieces, choosing instead to move the whole thing to the start of the next line, but that's a small price to pay.

@Keno
Copy link
Member Author

Keno commented Apr 1, 2015

@gnachman If I print a space next to a character, is there a chance the space will get drawn on top of it during a redraw? I think I've seen that behavior in my experiments.

@gnachman
Copy link

gnachman commented Apr 1, 2015

@Keno Yes, that can happen. I'm working on a fix to that issue in my refactor_drawing branch. Feel free to try it if you're feeling brave :). I expect to merge it into master in a week or two. Terminal.app doesn't have that issue, so that approach is safe to use.

@JeffBezanson
Copy link
Member

It's not our preference to depart from standards here. Just looking at that glyph, clearly somebody thinks it is double-width. As @stevengj said there is no clear standard.

@stevengj
Copy link
Member

stevengj commented Apr 1, 2015

wcwidth is only "standard" in the sense that it is used by many programs; it is not consistent even between MacOS versions, much less across operating systems, and is invariably out of date.

Note that UAX#11 provides a clear standard for a subset of Unicode, and wcwidth as of MacOS 10.10.2 does not conform to Unicode 7 in the sense that it reports -1 (not printable / not recognized) for many of the characters listed in UAX#11 as having width 1 or 2 (narrow/wide).

@jiahao
Copy link
Member

jiahao commented Apr 2, 2015

@gnachman The rationale and details of the analyses used to justify Julia's implementation are explained in JuliaStrings/utf8proc#2 and JuliaStrings/utf8proc#27 and in this notebook, which amongst other things details the exact discrepancies between my system wcwidth and the analysis. In all the cases I examined, I could not find a reason to justify the system answer over the analysis outlined in the issues and notebook.

As @JeffBezanson and @stevengj have already stated, there is no standard governing character widths, and so it is not possible to characterize Julia as "departing from the standard". On the contrary, it appears that not enough thought has gone into any other implementation for the purpose of determining character widths.

To illustrate our reasoning, consider the pizza character U+1F355. The relevant entry in EastAsianWidths.txt is:

1F330..1F37D;N # So [78] CHESTNUT..FORK AND KNIFE WITH PLATE

which assigns it the "neutral" category (not "narrow", which is coded as "Na"). Thus it falls into the nebulous category where UAX 11 has essentially nothing to say because the character doesn't exist in legacy East Asian encodings. (UAX 11 even says in its Scope not to consider it an authoritative source on character widths, but rather that

The East_Asian_Width is an informative property... the guidelines on use of this property should be considered recommendations based on a particular legacy practice that may be overridden by implementations as necessary.

)

In the absence of a clear standard, the best I could come up with is to look at a font that actually bothered to provide a glyph for that code point, hence settling on Unifont, which provides this glyph:

uni01f3

Note that the character width assigned by inspecting the advance width from Unifont agrees with the eyeball comparison of the reference glyph in the Unicode character charts (pdf).

screen shot 2015-04-01 at 11 39 54 pm

Superimposed for reference is a square box. I do not see any reason why this should be 'narrow' instead of 'fullwidth'.

@gnachman
Copy link

gnachman commented Apr 2, 2015

@jiahao, I wasn't criticizing your work. The informal agreement between client and server, which as you note is underspecified, is what is rickety. Your work is really valuable--I wish it (or something like it) were widely adopted.

I had believed that EastAsianWidth.txt was "the standard", but I'm persuaded that there isn't really one at all. AFAIK most apps treat N as narrow, but it leads to the problems described in this bug.

@timholy
Copy link
Member

timholy commented Apr 2, 2015

It sounds like this should be reported upstream/more widely, if it hasn't been already.

@stevengj
Copy link
Member

stevengj commented Apr 2, 2015

Unfortunately, it seems like the only upstream that can really fix this is libc, in order to fix wcwidth. I don't know where to file this kind of low-level bug report with Apple (??), and Microsoft is hopeless because of their wchar_t size, but it would be worthwhile for someone to check utf8proc against the latest GNU libc and file a bug report for discrepancies where libc is clearly wrong.

@jiahao
Copy link
Member

jiahao commented Apr 2, 2015

glibc#4335

@vtjnash
Copy link
Member

vtjnash commented Mar 14, 2016

Julia has the right (most updated) char widths, so it's up to the user to demand their terminal emulators are displaying properly. most likely, that'll happen gradually with various companies (#7267) lagging behind more or less from the standards committee.

@Keno
Copy link
Member Author

Keno commented Jun 24, 2016

Britain may be leaving the EU, but Unicode 9 came out and fixed this issue for us, so overall it's a pretty good day:
screen shot 2016-06-24 at 12 29 30 am

@Keno Keno removed the won't change Indicates that work won't continue on an issue or pull request label Jun 24, 2016
@Keno
Copy link
Member Author

Keno commented Jun 24, 2016

iTerm2 PR: gnachman/iTerm2#294

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
😃🍕 and other emoji bug Indicates an unexpected problem or unintended behavior unicode Related to unicode characters and encodings upstream The issue is with an upstream dependency, e.g. LLVM
Projects
None yet
Development

No branches or pull requests