-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
collect(graphemes("π€¦πΌββοΈ"))
results in ERROR: ArgumentError: destination has fewer elements than required
#37680
Comments
I can reproduce this on master. |
I think the problem is that the diff --git a/base/strings/unicode.jl b/base/strings/unicode.jl
index 235f85184d..029787b3c1 100644
--- a/base/strings/unicode.jl
+++ b/base/strings/unicode.jl
@@ -671,7 +671,7 @@ function length(g::GraphemeIterator{S}) where {S}
return n
end
-function iterate(g::GraphemeIterator, i_=(Int32(0),firstindex(g.s)))
+function iterate(g::GraphemeIterator{S}, i_=(Int32(0),firstindex(g.s))) where {S}
s = g.s
statei, i = i_
state = Ref{Int32}(statei)
@@ -679,6 +679,7 @@ function iterate(g::GraphemeIterator, i_=(Int32(0),firstindex(g.s)))
y = iterate(s, i)
y === nothing && return nothing
c0, k = y
+ i == firstindex(g.s) && isgraphemebreak!(state, eltype(S)(0x00000000), c0)
while k <= ncodeunits(s) # loop until next grapheme is s[i:j]
c, β = iterate(s, k)
isgraphemebreak!(state, c0, c) && break |
I don't think that should be required. If it is, that potentially seems like a bug in utf8proc |
No, it seems to be documented: https://juliastrings.github.io/utf8proc/doc/utf8proc_8h.html#aae83bdcabf3a97c1046c0700ba353640. Apparently, we don't need to pass the state around though, we can just always set it to zero initially. |
I wrote that documentation :). Yes, we don't need to pass the state around. Setting it to zero is fine. If that's required though that's a bug in utf8proc where it should reset the state by itself. |
I just thought this was intended because that's how it's done in |
It's definitely not supposed to be needed because |
Thanks for the comment @Keno and @simeonschaub for looking into this! I am going through the source of UAX#29 defines the grapheme cluster boundary rules in this section. Specifically there's this rule: http://www.unicode.org/reports/tr29/#GB11
These lines in
My understanding is that this is implemented as a state machine. I think it is this part of the "state machine" that is causing the issue.
I did some print style debugging to figure out what is going on. Here is the state machine transition in the following format:
You can see that when the boundclass is I made the following change
And now the state machine goes from
I ran Edit: Ignore the text below in this comment. See next comment for more information.
Specifically, there's this rule that says:
I don't understand why the specification says "break at the start" or what "unless the text is empty" means. What "text" is being referred to here? This is the implementation for this rule:
This returns I changed that line to the following:
This is obviously a violation of the specification and with this change But with this change and with the fix to the state machine that I mentioned above I'm getting the behavior I expect. This also seems to work for every other case I experiment with. I didn't run a full test suite though and this is my first foray into the Unicode specification and I've only read this section. Any suggestions here would be very welcome. |
I took a second stab at this just to explore again what is going on. It turns out I had a bug in my implementation! (I had re-implemented the state machine in Julia for my testing and messed up one line there.) This is the only change I needed to make to resolve this issue, exactly the same as the previous comment.
Ignore the part about the The only question I have is what other states should transition to |
Using fix proposed here JuliaLang/julia#37680
I've been experimenting with this to solve a similar problem, and it seems that all of these cases are only broken if the emoji cluster appears at the beginning of the string For example: julia> collect(graphemes("π€¦πΌββοΈ"))
ERROR: ArgumentError: destination has fewer elements than required
Stacktrace:
[1] copyto!(::Array{SubString{String},1}, ::Base.Unicode.GraphemeIterator{String}) at ./abstractarray.jl:734
[2] _collect at ./array.jl:630 [inlined]
[3] collect(::Base.Unicode.GraphemeIterator{String}) at ./array.jl:624
[4] top-level scope at REPL[3]:1
julia> collect(graphemes("a" * "π€¦πΌββοΈ"))
2-element Array{SubString{String},1}:
"a"
"π€¦πΌ\u200dβοΈ"
julia> collect(graphemes("πΈπͺπΈπͺ"))
2-element Array{SubString{String},1}:
"πΈπͺπΈ"
"πͺ"
julia> collect(graphemes("a"*"πΈπͺπΈπͺ"))
3-element Array{SubString{String},1}:
"a"
"πΈπͺ"
"πΈπͺ"
julia> collect(graphemes("π¨π»βπ€βπ¨π½"))
ERROR: ArgumentError: destination has fewer elements than required
Stacktrace:
[1] copyto!(::Array{SubString{String},1}, ::Base.Unicode.GraphemeIterator{String}) at ./abstractarray.jl:734
[2] _collect at ./array.jl:630 [inlined]
[3] collect(::Base.Unicode.GraphemeIterator{String}) at ./array.jl:624
[4] top-level scope at REPL[4]:1
julia> collect(graphemes("a"*"π¨π»βπ€βπ¨π½"))
2-element Array{SubString{String},1}:
"a"
"π¨π»\u200dπ€\u200dπ¨π½" |
Thanks for looking into it! Does it fail even with the patch?
It probably enters the state machine into this incorrect state only in some cases. |
Just did some more tests. The patch fixes "π€¦πΌββοΈ" and ""π¨π»βπ€βπ¨π½" but not "πΈπͺπΈπͺ", while prepending a letter fixes all three. There must be something to do with the initialization of the state |
I have a patch for this that fixes the two (or multiple flags) case
This patch special cases when there's a regional indicator in the start. |
Oh sick. I can add that to my fork and submit a PR. I've got some additional tests added to the graphemetest.c file |
Thanks for pushing it along! |
Update: the underlying bug here was that the initial Once the tests on the latest utf8proc PR are green, I'll merge and then post a PR to make Julia use the updated utf8proc. |
I believe there's an issue with the implementation of
iterate
on the result ofgraphemes("π€¦πΌββοΈ")
.The following works as expected:
However, this is the error I get when I try to collect the result of the Iterator in the latest stable release of Julia (v1.5.1).
The implementation of
length
appears to be correct, but I think there's a bug in the implementation ofiterate
:I believe the correct behavior in this case is to return just the first element.
iterate
seems to be returning the Symbolβ
as well.The text was updated successfully, but these errors were encountered: