-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add isemoji function to Unicode stdlib and export it #38458
Conversation
I'm not sure why the tests are failing here. Looking at the buildbot results, it doesn't seem to have anything to do with code I introduced? The windows versions fail on a different test, and the linux and mac versions pass all tests but error at a later point due to something involving python (i think?). Anyone know what's going on? |
Co-authored-by: Stefan Karpinski <stefan@karpinski.org>
Handle empty strings Co-authored-by: Stefan Karpinski <stefan@karpinski.org>
stdlib/Unicode/src/Unicode.jl
Outdated
0x02600:0x026FF, # Misc symbols | ||
0x02700:0x027BF, # Dingbats | ||
0x0FE00:0x0FE0F, # Variation Selectors | ||
0x1F900:0x1F9FF, # Supplemental Symbols and Pictographs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be generated from https://unicode.org/Public/emoji/13.1/emoji-sequences.txt somehow? Or at least have a test that checks that everything from https://unicode.org/Public/emoji/13.1/emoji-sequences.txt is in this set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good idea. I'll get to work on this. Is it ok to include a local copy of this file in the /Unicode/test/ directory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we just download it during the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can but it may make CI flaky. Let's try that and see how it goes. If it's flaky we may need to cache a copy but that can be dealt with separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented y'all's suggestions, so now we download https://www.unicode.org/Public/13.0.0/ucd/emoji/emoji-data.txt from the unicode website and use that to generate a const containing emoji ranges, which are then tested against. In the tests, we download the https://unicode.org/Public/emoji/13.1/emoji-sequences.txt and https://www.unicode.org/Public/emoji/13.1/emoji-zwj-sequences.txt files, which we use to build a list of all valid emoji, which we then test the isemoji function against.
Co-authored-by: Stefan Karpinski <stefan@karpinski.org>
The previous range list turned out not to be exhaustive. Now, we read the single emoji ranges directly from the unicode consortium website. Additionally added support for emojis that contain variation selectors ('\uFE0F')
Something's broken now and I'm not sure why. |
stdlib/Unicode/src/Unicode.jl
Outdated
@@ -89,4 +89,86 @@ letter combined with an accent mark is a single grapheme.) | |||
""" | |||
graphemes(s::AbstractString) = Base.Unicode.GraphemeIterator{typeof(s)}(s) | |||
|
|||
const emoji_data = download("https://www.unicode.org/Public/13.0.0/ucd/emoji/emoji-data.txt") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t think we want to store this entire string in the compiled library. You should just download it when you parse the data, maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your old code that just looked at certain code point blocks was a lot more compact and independent of the Unicode version. I’m just not sure if it is standard conforming?
Could we just use the old code combined with checking the category code to see if the code point is assigned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I still think we should use the data file, but only for tests.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As i looked more into the emoji codepoints, i found some corner cases which the old version didn't catch. It would be possible to patch those, but the secondary issue was that there's a bunch of unassigned emoji in the blocks in the old version. If we're ok with saying "yes this is an emoji" to characters in those larger blocks which are not (yet) emoji, then we can go with the old system. If not, we would either need to restrict those ranges to those codepoints which are currently assigned manually or via parsing the emoji_data file like i do here.
i do agree the old version was a lot simpler, so i'm not sure what the best solution is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we should return false
for currently unassigned codepoints, but we can check for that simply by returning false
if Unicode.category_code(char) == Unicode.UTF8PROC_CATEGORY_CN
(Or even be more restrictive: only allow category So or Sk.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh neat, didn't know about that. I'll get back with a new version later today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So looking at this, there are things in the emoji data file which fall under unicode category Sm and Po (◾ and ‼, respectively), and things in So which do not qualify as emoji (Ⓕ). To be honest, I'm not sure why some of these symbols get called emoji and some don't, but I think the only way to be complete here is to use the full list of ranges. We don't have to download it necessarily (could take the output from the file and manually include that in the unicode.jl file, but that's a 702-element array), but I'm not sure of another way to make sure we catch all emoji without false positives
What is the |
Another question: suppose we had a good definition of a single character being an emoji; given, what is the utility of |
After some thought and retooling yesterday, I agree. The utility most makes sense in identifying a single emoji, not whether a string is all emoji. I have a new version I'll upload in a bit that does that well and catches all the emoji in the dataset. We could separately have a The remaining question is how do we want do the check for single characters? the most effective way is to parse emoji_data.txt, but do we want to download it? have a cached copy? a pre-made list of ranges in a separate file? i think the last would be most space efficient and would obviate the need for emoji file parsing utilities in the main body of the Unicode module and let them live in the tests instead |
The problem with only having |
I have a version (to be uploaded shortly) that has isemoji as a character predicate and a string predicate, so that we get this behavior julia> teststring = "My family looks like this: 👩👩👦👦 and I ❤️ them"
julia> isemoji(teststring)
false
julia> filter(isemoji, [g for g in graphemes(teststring)])
2-element Array{SubString{String},1}:
"👩\u200d👩\u200d👦\u200d👦"
"❤️"
julia> isemoji("😈😘")
false
julia> isemoji('👦')
true
julie> isemoji("")
false This seems to me the most intuitive way to address the question. |
Ok, I've uploaded my new version. I'm not sure if the julia> [g for g in graphemes("👨🏻🤝👨🏽")]
1-element Array{SubString{String},1}:
"👨🏻\u200d"
julia> [c for c in "👨🏻🤝👨🏽"]
7-element Array{Char,1}:
'👨': Unicode U+1F468 (category So: Symbol, other)
'🏻': Unicode U+1F3FB (category Sk: Symbol, modifier)
'\u200d': Unicode U+200D (category Cf: Other, format)
'🤝': Unicode U+1F91D (category So: Symbol, other)
'\u200d': Unicode U+200D (category Cf: Other, format)
'👨': Unicode U+1F468 (category So: Symbol, other)
'🏽': Unicode U+1F3FD (category Sk: Symbol, modifier) The python graphemes package does catch this one correctly >>> import grapheme
>>> str = "👨🏻🤝👨🏽"
>>> list(grapheme.graphemes(str))
['👨🏻\u200d🤝\u200d👨🏽'] EDIT: Double regional indicators seem not to work either: julia> [g for g in graphemes("🇸🇪🇸🇪")]
2-element Array{SubString{String},1}:
"🇸🇪🇸"
"🇪" |
The problem with the graphemes seems to be related to #37680, and stems from a bug in the utf8-proc library. |
I think the new behavior makes sense but now I worry about the name: "emoji" is both plural and singular, so it might be surprising to someone that this function considers a string that consists of two emoji to not be emoji. So perhaps this predicate should be called |
My basic concern here is about adding a nontrivial new function to the stdlib without clear explanation of what it is for. What is the application of this function? |
I think the application is counting emoji in text. You would do it by calling |
Yeah my main application for this has been parsing and analyzing message data to determine things like emoji frequencies, usage patterns etc. I have also been thinking a package might be easier for this at least for now. |
Closing, per apparent consensus among participants that this should be a separate package. |
Resolves #38063