Add isemoji function to Unicode stdlib and export it #38458

archermarx · 2020-11-16T16:03:50Z

Resolves #38063

archermarx · 2020-11-16T18:28:43Z

I'm not sure why the tests are failing here. Looking at the buildbot results, it doesn't seem to have anything to do with code I introduced? The windows versions fail on a different test, and the linux and mac versions pass all tests but error at a later point due to something involving python (i think?). Anyone know what's going on?

stdlib/Unicode/src/Unicode.jl

Co-authored-by: Stefan Karpinski <stefan@karpinski.org>

Handle empty strings Co-authored-by: Stefan Karpinski <stefan@karpinski.org>

stdlib/Unicode/test/runtests.jl

stevengj · 2020-11-16T20:29:29Z

stdlib/Unicode/src/Unicode.jl

+    0x02600:0x026FF,  # Misc symbols
+    0x02700:0x027BF,  # Dingbats
+    0x0FE00:0x0FE0F,  # Variation Selectors
+    0x1F900:0x1F9FF,   # Supplemental Symbols and Pictographs


Should this be generated from https://unicode.org/Public/emoji/13.1/emoji-sequences.txt somehow? Or at least have a test that checks that everything from https://unicode.org/Public/emoji/13.1/emoji-sequences.txt is in this set?

That's a good idea. I'll get to work on this. Is it ok to include a local copy of this file in the /Unicode/test/ directory?

Can't we just download it during the test?

We can but it may make CI flaky. Let's try that and see how it goes. If it's flaky we may need to cache a copy but that can be dealt with separately.

I implemented y'all's suggestions, so now we download https://www.unicode.org/Public/13.0.0/ucd/emoji/emoji-data.txt from the unicode website and use that to generate a const containing emoji ranges, which are then tested against. In the tests, we download the https://unicode.org/Public/emoji/13.1/emoji-sequences.txt and https://www.unicode.org/Public/emoji/13.1/emoji-zwj-sequences.txt files, which we use to build a list of all valid emoji, which we then test the isemoji function against.

Co-authored-by: Stefan Karpinski <stefan@karpinski.org>

The previous range list turned out not to be exhaustive. Now, we read the single emoji ranges directly from the unicode consortium website. Additionally added support for emojis that contain variation selectors ('\uFE0F')

archermarx · 2020-11-17T00:11:44Z

Something's broken now and I'm not sure why.

stevengj · 2020-11-17T01:28:21Z

stdlib/Unicode/src/Unicode.jl

@@ -89,4 +89,86 @@ letter combined with an accent mark is a single grapheme.)
 """
 graphemes(s::AbstractString) = Base.Unicode.GraphemeIterator{typeof(s)}(s)

+const emoji_data = download("https://www.unicode.org/Public/13.0.0/ucd/emoji/emoji-data.txt")


I don’t think we want to store this entire string in the compiled library. You should just download it when you parse the data, maybe?

Yeah, that makes sense.

Your old code that just looked at certain code point blocks was a lot more compact and independent of the Unicode version. I’m just not sure if it is standard conforming?

Could we just use the old code combined with checking the category code to see if the code point is assigned?

(I still think we should use the data file, but only for tests.)

As i looked more into the emoji codepoints, i found some corner cases which the old version didn't catch. It would be possible to patch those, but the secondary issue was that there's a bunch of unassigned emoji in the blocks in the old version. If we're ok with saying "yes this is an emoji" to characters in those larger blocks which are not (yet) emoji, then we can go with the old system. If not, we would either need to restrict those ranges to those codepoints which are currently assigned manually or via parsing the emoji_data file like i do here.

i do agree the old version was a lot simpler, so i'm not sure what the best solution is.

I agree that we should return false for currently unassigned codepoints, but we can check for that simply by returning false if Unicode.category_code(char) == Unicode.UTF8PROC_CATEGORY_CN

(Or even be more restrictive: only allow category So or Sk.)

Oh neat, didn't know about that. I'll get back with a new version later today.

So looking at this, there are things in the emoji data file which fall under unicode category Sm and Po (◾ and ‼, respectively), and things in So which do not qualify as emoji (Ⓕ). To be honest, I'm not sure why some of these symbols get called emoji and some don't, but I think the only way to be complete here is to use the full list of ranges. We don't have to download it necessarily (could take the output from the file and manually include that in the unicode.jl file, but that's a 702-element array), but I'm not sure of another way to make sure we catch all emoji without false positives

stevengj · 2020-11-17T22:07:42Z

What is the isemoji function good for, especially if the set of "emoji" characters is a bit fuzzy and doesn't include all dingbats?

StefanKarpinski · 2020-11-18T17:15:14Z

Another question: suppose we had a good definition of a single character being an emoji; given, what is the utility of isemoji on a string? Why would one want to know that a string only consists of emoji? Wouldn't it be more useful to ask if any of the characters in a string are emoji or not? The zero-width joiner issue complicates things, but it's a little unclear to me what the right predicate should be. Most questions you could ask about a string could presumably be answered by calling graphemes and then applying a predicate to individual graphemes. Maybe that's why this PRs predicate makes sense: it's only intended to be applied to single-grapheme strings?

archermarx · 2020-11-18T17:46:05Z

After some thought and retooling yesterday, I agree. The utility most makes sense in identifying a single emoji, not whether a string is all emoji. I have a new version I'll upload in a bit that does that well and catches all the emoji in the dataset. We could separately have a contains_emoji function. My use case for this has been in parsing text and compiling statistics of emoji usage in large corpuses of internet conversation, so i think splitting out individual emoji makes far more sense than just saying "everything in this string is an emoji".

The remaining question is how do we want do the check for single characters? the most effective way is to parse emoji_data.txt, but do we want to download it? have a cached copy? a pre-made list of ranges in a separate file? i think the last would be most space efficient and would obviate the need for emoji file parsing utilities in the main body of the Unicode module and let them live in the tests instead

StefanKarpinski · 2020-11-18T18:54:51Z

The problem with only having isemoji as a character predicate would seem to be that some emoji are not a single character but a whole grapheme, so if you want to count them you probably want to count such a sequence as a single emoji, right? One way to do this would be to call graphemes to iterate the graphemes in a string and then have a predicate that can be applied to each grapheme string to indicate whether it is an emoji grapheme or not. Of course, that's not the only way that predicate can be called, so the question is what is the most useful generalization from a predicate like that to arbitrary strings?

archermarx · 2020-11-18T19:07:20Z

I have a version (to be uploaded shortly) that has isemoji as a character predicate and a string predicate, so that we get this behavior

julia> teststring = "My family looks like this: 👩‍👩‍👦‍👦 and I ❤️ them"

julia> isemoji(teststring)
false

julia> filter(isemoji, [g for g in graphemes(teststring)])
2-element Array{SubString{String},1}:
 "👩\u200d👩\u200d👦\u200d👦"
 "❤️"

julia> isemoji("😈😘")
false

julia> isemoji('👦')
true

julie> isemoji("")
false

This seems to me the most intuitive way to address the question.

archermarx · 2020-11-18T19:56:53Z

Ok, I've uploaded my new version. I'm not sure if the graphemes function is working wholly correctly. For instance,

julia> [g for g in graphemes("👨🏻‍🤝‍👨🏽")]
1-element Array{SubString{String},1}:
 "👨🏻\u200d"

 julia> [c for c in "👨🏻‍🤝‍👨🏽"]
7-element Array{Char,1}:
 '👨': Unicode U+1F468 (category So: Symbol, other)
 '🏻': Unicode U+1F3FB (category Sk: Symbol, modifier)
 '\u200d': Unicode U+200D (category Cf: Other, format)
 '🤝': Unicode U+1F91D (category So: Symbol, other)
 '\u200d': Unicode U+200D (category Cf: Other, format)
 '👨': Unicode U+1F468 (category So: Symbol, other)
 '🏽': Unicode U+1F3FD (category Sk: Symbol, modifier)

The python graphemes package does catch this one correctly

>>> import grapheme
>>> str = "👨🏻‍🤝‍👨🏽"
>>> list(grapheme.graphemes(str))
['👨🏻\u200d🤝\u200d👨🏽']

EDIT:

Double regional indicators seem not to work either:

julia> [g for g in graphemes("🇸🇪🇸🇪")]
2-element Array{SubString{String},1}:
 "🇸🇪🇸"
 "🇪"

archermarx · 2020-11-19T00:51:17Z

The problem with the graphemes seems to be related to #37680, and stems from a bug in the utf8-proc library.

StefanKarpinski · 2020-11-30T17:18:46Z

I think the new behavior makes sense but now I worry about the name: "emoji" is both plural and singular, so it might be surprising to someone that this function considers a string that consists of two emoji to not be emoji. So perhaps this predicate should be called isoneemoji? Or perhaps it would be better to provide some kind of find-like interface instead, such as findemoji(s::AbstratString, i::Integer) which returns nothing if there are no emoji in the given string or a pair of indices if there are?

stevengj · 2020-11-30T17:58:05Z

My basic concern here is about adding a nontrivial new function to the stdlib without clear explanation of what it is for. What is the application of this function?

StefanKarpinski · 2020-11-30T20:34:28Z

I think the application is counting emoji in text. You would do it by calling graphemes and then calling isemoji on each grapheme. It seems like this might be better suited to a package which can evolve its API as necessary.

archermarx · 2020-12-01T15:42:01Z

Yeah my main application for this has been parsing and analyzing message data to determine things like emoji frequencies, usage patterns etc. I have also been thinking a package might be easier for this at least for now.

vtjnash · 2021-04-16T01:29:38Z

Closing, per apparent consensus among participants that this should be a separate package.

archermarx added 2 commits November 16, 2020 11:02

Add isemoji function and export it

3a5c8d7

Add emoji tests

ca8f747

archermarx changed the title ~~Add isemoji function and export it~~ Add isemoji function to Unicode stdlib and export it Nov 16, 2020

Fix tests involving unexported zero-width joiner

8029e2d

archermarx marked this pull request as draft November 16, 2020 17:00

archermarx marked this pull request as ready for review November 16, 2020 18:27

StefanKarpinski requested changes Nov 16, 2020

View reviewed changes

stdlib/Unicode/src/Unicode.jl Outdated Show resolved Hide resolved

stdlib/Unicode/src/Unicode.jl Outdated Show resolved Hide resolved

archermarx and others added 2 commits November 16, 2020 15:25

Update stdlib/Unicode/src/Unicode.jl

ae86600

Co-authored-by: Stefan Karpinski <stefan@karpinski.org>

Update stdlib/Unicode/src/Unicode.jl

773799d

Handle empty strings Co-authored-by: Stefan Karpinski <stefan@karpinski.org>

StefanKarpinski requested changes Nov 16, 2020

View reviewed changes

stdlib/Unicode/test/runtests.jl Show resolved Hide resolved

stevengj reviewed Nov 16, 2020

View reviewed changes

archermarx and others added 5 commits November 16, 2020 15:33

Update isemoji documentation

6154f0e

Update stdlib/Unicode/test/runtests.jl

8c6bd6c

Co-authored-by: Stefan Karpinski <stefan@karpinski.org>

Redo isemoji to fix missed corner cases

3bc23ce

The previous range list turned out not to be exhaustive. Now, we read the single emoji ranges directly from the unicode consortium website. Additionally added support for emojis that contain variation selectors ('\uFE0F')

Add new emoji tests

22c43ed

Update documentation

1382b12

stevengj reviewed Nov 17, 2020

View reviewed changes

Download when parsing emoji file

cd5092e

Update isemoji function and add emoji_ranges file

7176501

archermarx added 2 commits November 19, 2020 01:23

Remove incorrect emoji ranges

f80e7ab

Support two-character keycap pattern

b9b54a1

Fix test, but one test in precompile is failing

b0d0156

archermarx requested a review from StefanKarpinski November 20, 2020 16:58

StefanKarpinski marked this pull request as draft December 3, 2020 17:12

vtjnash mentioned this pull request Apr 16, 2021

Feature request: isemoji function #38063

Closed

vtjnash closed this Apr 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add isemoji function to Unicode stdlib and export it #38458

Add isemoji function to Unicode stdlib and export it #38458

archermarx commented Nov 16, 2020 •

edited

Loading

archermarx commented Nov 16, 2020

stevengj Nov 16, 2020

archermarx Nov 16, 2020

stevengj Nov 16, 2020

StefanKarpinski Nov 16, 2020

archermarx Nov 16, 2020

archermarx commented Nov 17, 2020

stevengj Nov 17, 2020

archermarx Nov 17, 2020

stevengj Nov 17, 2020

stevengj Nov 17, 2020

archermarx Nov 17, 2020 •

edited

Loading

stevengj Nov 17, 2020 •

edited

Loading

archermarx Nov 17, 2020

archermarx Nov 17, 2020

stevengj commented Nov 17, 2020

StefanKarpinski commented Nov 18, 2020

archermarx commented Nov 18, 2020

StefanKarpinski commented Nov 18, 2020

archermarx commented Nov 18, 2020 •

edited

Loading

archermarx commented Nov 18, 2020 •

edited

Loading

archermarx commented Nov 19, 2020

StefanKarpinski commented Nov 30, 2020

stevengj commented Nov 30, 2020 •

edited

Loading

StefanKarpinski commented Nov 30, 2020

archermarx commented Dec 1, 2020

vtjnash commented Apr 16, 2021

Add isemoji function to Unicode stdlib and export it #38458

Add isemoji function to Unicode stdlib and export it #38458

Conversation

archermarx commented Nov 16, 2020 • edited Loading

archermarx commented Nov 16, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

archermarx commented Nov 17, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

archermarx Nov 17, 2020 • edited Loading

Choose a reason for hiding this comment

stevengj Nov 17, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevengj commented Nov 17, 2020

StefanKarpinski commented Nov 18, 2020

archermarx commented Nov 18, 2020

StefanKarpinski commented Nov 18, 2020

archermarx commented Nov 18, 2020 • edited Loading

archermarx commented Nov 18, 2020 • edited Loading

archermarx commented Nov 19, 2020

StefanKarpinski commented Nov 30, 2020

stevengj commented Nov 30, 2020 • edited Loading

StefanKarpinski commented Nov 30, 2020

archermarx commented Dec 1, 2020

vtjnash commented Apr 16, 2021

archermarx commented Nov 16, 2020 •

edited

Loading

archermarx Nov 17, 2020 •

edited

Loading

stevengj Nov 17, 2020 •

edited

Loading

archermarx commented Nov 18, 2020 •

edited

Loading

archermarx commented Nov 18, 2020 •

edited

Loading

stevengj commented Nov 30, 2020 •

edited

Loading